How often should I execute the LDA for whole document corpus? - information-retrieval

Let's assume that we have a moderately growing document corpus i.e. some new documents get added to this document corpus everyday. For these newly added documents, I can infer the topic distributions just by using the inference part of the LDA. I do not have to execute the whole topic estimation + inference process of LDA for all documents again just to get the topic distributions for these new documents. However, over the period of time, I might need to do the whole topic generation process again as the number of documents added newly since the last LDA execution might add totally new words to the document corpus.
Now, the question that I have is - how to determine the good enough interval between two topic generation executions? Are there any general recommendations on how often should we execute the LDA for whole document corpus?
If I keep this interval very short then, I might lose the stable topic distributions and topic distributions will keep changing. If I keep the interval too long then, I might lose the new topics and new topic structures.

I'm just thinking aloud here... One very simple idea is to sample a subset of documents from the bunch of newly added documents (say over a period of one day).
You could possibly extract key words from each of these documents in the sampled set, and execute each as a query to the index built from a version of the collection that existed before adding these new documents.
You could then measure the average cosine similarity of the top K documents retrieved in response to each query (and average them over each query from the sampled set of queries). If this average similarity is less than a pre-defined threshold, it might indicate that the new documents are not that similar to the existing ones. It might thus be a good idea to rerun LDA on the whole collection.

Related

Efficiently Query Solr 9 for Similarity Score with Filter Queries

I'm using Solr 9 for optimal query-document similarity calculations. I have a use-case where I have to query for specific field values first, and then compute document similarities on all of the documents that are found.
My problem is as follows:
If each document has a field "embedding" and "id", I want to only retrieve documents with id=1,2,3, and given a query embedding, return the similarity score of each document with the query embedding.
Option 1: Query for the id's using fq, and the q field using knn. Not all documents that I want will be returned because of the limitation below.
The main issue with this is documented here:
When using knn in re-ranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the k-nearest neighbors(in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current limitation.
Option 2: Query for the id's using fq, get the embedding in the field list, and compute the similarities in memory. The issue with that is the network latency, since the size of the response from Solr is large when retrieving the embeddings.
That leaves the following two questions:
When will the limitation in the document above be solved, if at all?
Is there a way to compress the response from Solr such that I can retrieve the response faster?
Thanks!

How would I order my collection on timestamp and score

I have a collection with documents that have a createdAt timestamp and a score number. I sort all the documents on score for our leaderboard. But now I want to also have the daily best.
matchResults.orderBy("score").where("createdAt", ">", yesterday).startAt(someValue).limit(10);
But I found that there are limitations when using different fields.
https://firebase.google.com/docs/firestore/query-data/order-limit-data#limitations.
So how could I get the results of today in chuncks of 10 sorted on score?
You can use multiple orderBy(...) clauses to order on multiple fields, but this won't exactly meet your needs since you must first order by timestamp and only second by score.
A brute force option would be to fetch all the scores for the given day and truncate the list locally. But that of course won't work well if there are thousands of scores to load.
One simple answer would be to use a datestamp instead of timestamp:
matchResults.where("dayCreated", "=", "YYYY-MM-DD").orderBy("score").startAt(...).limit(10)
A second simple answer would be to run a Cloud Function on write events and maintain a daily top scores table separate from your scores data. If the results are frequently viewed, this would ultimately prove more economical and scalable as you would only need to record a small subset (say the top 100) by day, and can simply query that table ordering by score.
Scoreboards are extremely difficult at scale, so don't underestimate the complexity of handling every edge case. Start small and practical, focus on aggregating your results during writes and keep reads small and simple. Limit scope by listing only a top percentage for your "top scores" records and skip complex pagination schemas where possible.

Get statistics from firebase documents

I'm creating a React firebase website that has a collection of documents that contains a rating from 1 to 10. All of these documents have an author attached. The average rating of all of the author's documents should be calculated and presented.
Here are my current two solutions:
Calculate the average from all the documents with the same author
Add the statistic to the author himself, such that every time the author adds a new document it will update his statistic
My thought process of the second one is such that the website doesn't have to calculate the average rating each time it's requested. Would this be a bad idea, or isn't there a problem in the first place, reading all the documents and calculating in the first place?
Your second approach is in fact a best practice when working with NoSQL databases. If you calculate the average on demand across a dynamic number of documents, the cost of that operation will grow as you add more documents to the database.
For this reason you'll want to calculate all aggregates on write, and store them in the database. With that approach looking up an aggregate value is a simple write.
Also see:
The Firebase documentation on aggregation queries
The Firebase documentation on distributed counters
How to get a count of number of documents in a collection with Cloud Firestore
Leaderboard ranking with Firebase

Rank multiple queries against one document based on relevance

Given a list of queries and given one document, I want to rank the queries based on how relevant they are to the given document.
For each query, I calculated the term frequency of each word in the query.
(term frequency defined as the number of times the word occurs in the document divided by the total number of words in the document)
Now, I summed up the term frequencies for each term in the query.
For example:
search query: "Hello World"
document: "It is a beautiful world"
tf for 'Hello': 0
tf for 'World': 1/5 = 0.2
total tf for query 'Hello World' = 0 + 0.2 = 0.2
My question is, what is the best way to normalize my term frequency for each query? so that a long query doesn't result in a larger relevance score.
And, is there a better way for me to score the query than just using the tf score?
I can't use tf-idf in my scenario because I am ranking them against just one document.
Before answering your question, I want to correct you on your definition of term frequency. The way you defined term frequency is actually called maximum likelihood.
So, I am interpreting your first question as follows.
What is the best way to normalize final score (summation of maximum likelihood) for each query?
One simple approach is to divide the score by query length so that longer query doesn't receive higher score. Advanced techniques are also used in computing relevance score in the context of search engines.
Is there a better way for me to score the query than just using the tf score?
Yes, of course! One of the well-known and widely used ranking method called Okapi BM25 can be used here with little modification. You can think your target task as a ranking problem.
So, given a document, rank a set of queries based on their relevance with the document.
This is a well-known problem in the context of search engine. I encourage you to follow some lectures from any information retrieval class of any university. For example, this lecture slide talks about probabilistic ranking principle which aligns with your need.
Coming to your remark on not being able to use idf, 'I can't use tf-idf in my scenario because I am ranking them against just one document.', here's what you could do:
Keep in mind that your ranking (retrievable) units are queries. Hence, consider that there's a reversal of roles between documents and queries with reference to the standard terminology.
In other words, treat your queries as pseudo-documents and your document as the pseudo-query.
You can then apply a whole range of ranking models that make use of the collection statistics (being computed over the set of queries), e.g. language model, BM25, DFR etc.

Categorical Clustering of Users Reading Habits

I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.

Resources