Rank multiple queries against one document based on relevance - information-retrieval

Given a list of queries and given one document, I want to rank the queries based on how relevant they are to the given document.
For each query, I calculated the term frequency of each word in the query.
(term frequency defined as the number of times the word occurs in the document divided by the total number of words in the document)
Now, I summed up the term frequencies for each term in the query.
For example:
search query: "Hello World"
document: "It is a beautiful world"
tf for 'Hello': 0
tf for 'World': 1/5 = 0.2
total tf for query 'Hello World' = 0 + 0.2 = 0.2
My question is, what is the best way to normalize my term frequency for each query? so that a long query doesn't result in a larger relevance score.
And, is there a better way for me to score the query than just using the tf score?
I can't use tf-idf in my scenario because I am ranking them against just one document.

Before answering your question, I want to correct you on your definition of term frequency. The way you defined term frequency is actually called maximum likelihood.
So, I am interpreting your first question as follows.
What is the best way to normalize final score (summation of maximum likelihood) for each query?
One simple approach is to divide the score by query length so that longer query doesn't receive higher score. Advanced techniques are also used in computing relevance score in the context of search engines.
Is there a better way for me to score the query than just using the tf score?
Yes, of course! One of the well-known and widely used ranking method called Okapi BM25 can be used here with little modification. You can think your target task as a ranking problem.
So, given a document, rank a set of queries based on their relevance with the document.
This is a well-known problem in the context of search engine. I encourage you to follow some lectures from any information retrieval class of any university. For example, this lecture slide talks about probabilistic ranking principle which aligns with your need.

Coming to your remark on not being able to use idf, 'I can't use tf-idf in my scenario because I am ranking them against just one document.', here's what you could do:
Keep in mind that your ranking (retrievable) units are queries. Hence, consider that there's a reversal of roles between documents and queries with reference to the standard terminology.
In other words, treat your queries as pseudo-documents and your document as the pseudo-query.
You can then apply a whole range of ranking models that make use of the collection statistics (being computed over the set of queries), e.g. language model, BM25, DFR etc.

Related

Efficiently Query Solr 9 for Similarity Score with Filter Queries

I'm using Solr 9 for optimal query-document similarity calculations. I have a use-case where I have to query for specific field values first, and then compute document similarities on all of the documents that are found.
My problem is as follows:
If each document has a field "embedding" and "id", I want to only retrieve documents with id=1,2,3, and given a query embedding, return the similarity score of each document with the query embedding.
Option 1: Query for the id's using fq, and the q field using knn. Not all documents that I want will be returned because of the limitation below.
The main issue with this is documented here:
When using knn in re-ranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the k-nearest neighbors(in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current limitation.
Option 2: Query for the id's using fq, get the embedding in the field list, and compute the similarities in memory. The issue with that is the network latency, since the size of the response from Solr is large when retrieving the embeddings.
That leaves the following two questions:
When will the limitation in the document above be solved, if at all?
Is there a way to compress the response from Solr such that I can retrieve the response faster?
Thanks!

How would I order my collection on timestamp and score

I have a collection with documents that have a createdAt timestamp and a score number. I sort all the documents on score for our leaderboard. But now I want to also have the daily best.
matchResults.orderBy("score").where("createdAt", ">", yesterday).startAt(someValue).limit(10);
But I found that there are limitations when using different fields.
https://firebase.google.com/docs/firestore/query-data/order-limit-data#limitations.
So how could I get the results of today in chuncks of 10 sorted on score?
You can use multiple orderBy(...) clauses to order on multiple fields, but this won't exactly meet your needs since you must first order by timestamp and only second by score.
A brute force option would be to fetch all the scores for the given day and truncate the list locally. But that of course won't work well if there are thousands of scores to load.
One simple answer would be to use a datestamp instead of timestamp:
matchResults.where("dayCreated", "=", "YYYY-MM-DD").orderBy("score").startAt(...).limit(10)
A second simple answer would be to run a Cloud Function on write events and maintain a daily top scores table separate from your scores data. If the results are frequently viewed, this would ultimately prove more economical and scalable as you would only need to record a small subset (say the top 100) by day, and can simply query that table ordering by score.
Scoreboards are extremely difficult at scale, so don't underestimate the complexity of handling every edge case. Start small and practical, focus on aggregating your results during writes and keep reads small and simple. Limit scope by listing only a top percentage for your "top scores" records and skip complex pagination schemas where possible.

What happens when the top-k query does not find enough documents to satisfy k constraint?

I am evaluating the top-k range query using NDCG. Given a spatial area and a query keyword, my top-k range query must return k documents in the given area that are textual relevant to the query keyword.
In my scenario, the range query usually finds only one document to return. But I have to compare this query to another one who can find more objects in the given area, with the same keyword. This is possible because an approach I am testing to improve objects description.
I am not figuring out how to use NDCG to compare these two queries in this scenario. I would like to compare Query A and B using NDCG#5, NDCG#10, but Query A only finds one object. Query A will have high NDCG value because of its lower ability to find more objects (probably the value will be one - the maximum). Query B finds more objects (in my opinion, a better solution) but has a lower NDCG value than query A.
You can consider looking at a different measure, e.g. Recall#10, if you care less about the ranking for your application.
NDCG is a measure designed for web search, where you really want to penalize a system that doesn't return the best item at the topmost result, which is why it has an exponential decay factor. This makes sense for navigational queries like ``stackoverflow'' you will look quite bad if you don't return this website first.
It sounds like you are building something a little more sophisticated, where the user cares about many results. Therefore, a more recall-oriented measure (that cares about getting multiple things right more than the ranking) may make more sense.
its lower ability to find more objects
I'd also double-check your implementation of NDCG: you always want to divide by the ideal ranking, regardless of what actually gets returned. It sounds like your Query A returns 1 correct object, but Query B returns more correct objects, but not at high ranks? Either way, you expect Query A to be divided by the DCG of a perfect ranking -- that means 10, 20, or thousands of "correct" objects. It may be that you just don't have enough judgments, and therefore your "perfect ranking" is too small, and therefore you aren't penalizing Query A enough.

How often should I execute the LDA for whole document corpus?

Let's assume that we have a moderately growing document corpus i.e. some new documents get added to this document corpus everyday. For these newly added documents, I can infer the topic distributions just by using the inference part of the LDA. I do not have to execute the whole topic estimation + inference process of LDA for all documents again just to get the topic distributions for these new documents. However, over the period of time, I might need to do the whole topic generation process again as the number of documents added newly since the last LDA execution might add totally new words to the document corpus.
Now, the question that I have is - how to determine the good enough interval between two topic generation executions? Are there any general recommendations on how often should we execute the LDA for whole document corpus?
If I keep this interval very short then, I might lose the stable topic distributions and topic distributions will keep changing. If I keep the interval too long then, I might lose the new topics and new topic structures.
I'm just thinking aloud here... One very simple idea is to sample a subset of documents from the bunch of newly added documents (say over a period of one day).
You could possibly extract key words from each of these documents in the sampled set, and execute each as a query to the index built from a version of the collection that existed before adding these new documents.
You could then measure the average cosine similarity of the top K documents retrieved in response to each query (and average them over each query from the sampled set of queries). If this average similarity is less than a pre-defined threshold, it might indicate that the new documents are not that similar to the existing ones. It might thus be a good idea to rerun LDA on the whole collection.

Is there a way to get the list of indexed words from Marklogic universal index

I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??
Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!
What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.
"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms
I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not

Resources