What is a possibility for determining the amount of different binary search trees on a key set of 1,...,n?
I guess you are asking the total number of binary trees possible with n nodes, well that is Catalan Number which is (2n)!/(n+1)!*n! .
Related
As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates. textreuse is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.
Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.
If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:
object.size(trtd$minhashes)
> 1072 bytes
Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:
object.size(l$buckets)
> 6704 bytes
So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.
But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?
Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.
Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)
The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.
Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.
Let's say I've a million documents that I preprocessed (calculated signatures for using minhash) in O(D*sqrt(D)) time where D is the number of documents. When I'm given a query document, I've to return the first of the million preprocessed documents in O(sqrt(D)) time such that the jaccard similarity is greater than or equal to, say, 0.8.
If there's no document similar to the query document enough to reach that score, I've to return a document with similarity at least c * 0.8 (where c<1) with probability at least 1 - 1/e^2. How may I find the maximum value of C for this minhash scheme?
Your orders of complexity/time don't sound right. Calculating the minhashes (signature) for a document should be roughly O(n), where n is the number of features (e.g., words, or shingles).
Finding all similar documents to a given document (with estimated similarity above a given threshold) should be roughly O(log(n)), where n is the number of candidate documents.
A document with (estimated) minimum .8 jaccard similarity will have at least 80% of its minhashes matching the given document. You haven't defined c and e for us, so I can't tell what your minimum threshold is -- I'll leave that to you -- but you can easily achieve this efficiently in a single pass:
Work through all your base document's hashes one by one. For each hash, look in your hash dictionary for all other docs that share that hash. Keep a tally for each document found of how many hashes it shares. As soon as one of these tallies reaches 80% of the total number of hashes, you have found the winning document and can halt calculations. But if none of the tallies ever reach the .8 threshold, then continue to the end. Then you can choose the document with the highest tally and decide whether that passes your minimum threshold.
In information retrieval evaluation, what would precision#k be, if fewer than k documents are retrieved? Let's say only 5 documents were retrieved, of which 3 are relevant. Would the precision#10 be 3/10 or 3/5?
It can be hard to find text defining edge cases of measures like this, and the mathematical formulations often don't deal with the incompleteness of data. For issues like this, I tend to turn to the decision made by trec_eval which a tool distributed by NIST that has implementations of all common retrieval measures, especially those used by the challenges in Text Retrieval Conferences (TREC challenges).
Per the metric description in m_P.c of trec_eval 9.0 (called the latest on this page):
Precision measured at various doc level cutoffs in the ranking.
If the cutoff is larger than the number of docs retrieved, then
it is assumed nonrelevant docs fill in the rest. Eg, if a method
retrieves 15 docs of which 4 are relevant, then P20 is 0.2 (4/20).
Precision is a very nice user oriented measure, and a good comparison
number for a single topic, but it does not average well. For example,
P20 has very different expected characteristics if there 300
total relevant docs for a topic as opposed to 10.
This means that you should always divide by k even if fewer than k were retrieved, so the precision would be 0.3 instead of 0.6 in your particular case. (Punish the system for retrieving fewer than k).
The other tricky case is when there are fewer than k relevant documents. This is why they note that precision is a helpful measure but does not average well.
Some measures that are more robust to these issues are: Normalized Discounted Cumulative Gain (NDCG) which compares the ranking to an ideal ranking (at a cutoff) and (simpler) R-Precision: which calculates precision at the number of relevant documents, rather than a fixed k. So that one query may calculate P#15 for R=15, and another may calculate P#200 for R=200.
I read few solutions about nearest neighbor search in high-dimensions using random hyperplane, but I am still confused about how the buckets work. I have 100 millions of document in the form of 100-dimension vectors and 1 million queries. For each query, I need to find the nearest neighbor based on cosine similarity. The brute force approach is to find cosine value of query with all 100 million documents and select the the ones with value close to 1. I am struggling with the concept of random hyperplanes where I can put the documents in buckets so that I don't have to calculate cosine value 100 million times for each query.
Think in a geometric way. Imagine your data like points in a high dimensional space.
Create random hyperplanes (just planes in a higher dimension), do the reduction using your imagination.
These hyperplanes cut your data (the points), creating partitions, where some points are being positioned apart from others (every point in its partition; would be a rough approximation).
Now the buckets should be populated according to the partitions formed by the hyperplanes. As a result, every bucket contains much less points than the total size of the pointset (because every partition I talked about before, contains less points than the total size of your pointset).
As a consequence, when you pose a query, you check much less points (with the assistance of the buckets) than the total size. That's all the gain here, since checking less points, means that you do much better (faster) than the brute force approach, which checks all the points.
I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.
Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.
The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.
As a quick note, as #Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.
edit:
As #FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.
I hope this helps.
From "Introduction to Information Retrieval" book:
tf-idf assigns to term t a weight in document d that is
highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.
So words with lowest tf-idf can considered as stop words.