Efficiently Query Solr 9 for Similarity Score with Filter Queries - vector

I'm using Solr 9 for optimal query-document similarity calculations. I have a use-case where I have to query for specific field values first, and then compute document similarities on all of the documents that are found.
My problem is as follows:
If each document has a field "embedding" and "id", I want to only retrieve documents with id=1,2,3, and given a query embedding, return the similarity score of each document with the query embedding.
Option 1: Query for the id's using fq, and the q field using knn. Not all documents that I want will be returned because of the limitation below.
The main issue with this is documented here:
When using knn in re-ranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the k-nearest neighbors(in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current limitation.
Option 2: Query for the id's using fq, get the embedding in the field list, and compute the similarities in memory. The issue with that is the network latency, since the size of the response from Solr is large when retrieving the embeddings.
That leaves the following two questions:
When will the limitation in the document above be solved, if at all?
Is there a way to compress the response from Solr such that I can retrieve the response faster?
Thanks!

Related

Vector based search in solr

I am trying to implement dense vector based search in solr (currently using version 8.5.2). My requirement is
to store a dense vector representation for each document in solr in a field called vectorForm.
Now when a user issues some query, I am converting that query to some dense vector representation as well and now I want to get top 100 documents from solr that have highest dotProduct value between the query vector representation and vectorForm field (stored for each document above) in solr.
A few questions that I had around this are
What field type should be used to define the vectorForm field (does docValues with multiValued integers work best here)?
How do I efficiently do the above vector based retrieval? (keeping in mind that latency should be as low as possible)
I read that solr has dotProduct and cosinSimilarity functions but not able to understand how to use it here in my case, if thats the solution then any link towards an example implementation will help.
Any help or guidance will be a huge help for me.
You can use "dense vector search" starting with Solr 9.0.
https://solr.apache.org/guide/solr/9_0/query-guide/dense-vector-search.html
Neural Search has been released with Apache Solr 9.0.
The DenseVectorField gives the possibility of indexing and searching dense vectors of float elements, defining parameters such as the dimension of the dense vector to pass in, the similarity function to use, the knn algorithm to use, etc...
Currently, it is still necessary to produce the vectors externally and then push the obtained embeddings into Solr.
At query time you can use the k-nearest neighbors (knn) query parser that allows finding the k-nearest documents to the query vector according to indexed dense vectors in the given field.
Here is our End-to-End Vector Search Tutorial that can definitely help you understand how to leverage this new Solr feature to improve the user search experience
https://sease.io/2023/01/apache-solr-neural-search-tutorial.html

Two steps search to search document with similar vectors in Solr

I am thinking to find documents in Solr that have similar vectors.
A user enters a few keywords
A list of documents that have the keywords will be reported by Solr based on Solr's scoring alogrithms.
The user then select a couple of documents as the reference documents.
Solr will then search for documents that have close correlation (similar vectors) to the selected couple of documents.
For the first 3 steps, I know how to do it. But have no clue how to perform step 4. I have read [https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component][1], but still not sure how to perform step 4.
I can think of two approaches. The first is to use search results clustering. You first search by the keywords then ask solr to cluster the results. Present to the user the list of clusters and thier documents.
The second approach is to use multiple requests of the more like this handler and merge the results. In each request, you use a document from the reference documents that the user has marked.
The step 4 sounds like a More Like This function, which already ships with Solr.

Riak: how are queries using secondary indices implemented?

Consider a query that uses secondary indices. Does this cause the node that received the query to send out a request to all other nodes? That is, does the use of secondary indices require communicating with all other nodes to find data that matches the index lookup?
The best source for information on how querying of secondary indexes works can be found here:
http://docs.basho.com/riak/latest/dev/advanced/2i/
I believe that the portion of the explanation that is relevant to your question is:
"When issuing a query, the system must read from a “covering” set of partitions and then merge the results. The system looks at how many replicas of data are stored—the N value or n_val—and determines the minimum number of partitions that it must examine (1 / n_val) to retrieve a full set of results, also taking into account any offline nodes."
Also note that: "For all 2i queries, the R parameter is set to 1," - http://docs.basho.com/riak/latest/dev/using/2i/#Querying

Get all values of some parameter for all documents in Marklogic

I'm trying to get 'xxx' parameter of all documents in Marklogic using query like:
(/doc/document)/xxx
But since we have very big documents database I get an error "Expanded tree cache full on host". I don't have admin rights for this server, so I can't change configuration. I suggest that I can use ranges while getting documents like:
(/doc/document)[1 to 1000]/xxx
and then
(/doc/document)[1000 to 2000]/xxx
etc, but I'm concerned that I do not know how it works, for example, what will happen if during this process database will be changed (f.e. a new document will be added), how will it affect the result documents list? Also I don't know which order it uses in case when I use ranges...
Please clarify, is this way can be appropriate or is there any other ways to get some parameter of all documents?
Depending on how big your database is there may be no way to get all the values in one transaction.
Suppose you have a trillion documents, the result set will be bigger then can be returned in one transaction.
Is that important ? Only your business case can tell.
The most efficient way of getting all "xxx" values is with a range index. You can see how this works
with cts:element-values ( https://docs.marklogic.com/cts:element-values )
You do need to be able to create a range index over the element "xxxx" to do this (ask your DBA).
Then cts:element-values() returns only those values and the chances of being able to return most or all of them
in memory in a signle transaction is much higher then using xpath (/doc/document/xxx) which as you wrote actualy returns all the "xxx" elements (not just their values). The most likely requires actually loading every document matching /doc and then parsing it and returning the xxx element. That can be both slow and inefficient.
A range index just stores the values and you can retrieve those without ever having to load the actual document.
In general when working with large datasets learning how to access data in MarkLogic using only indexes will produce the fastest results.

Is there a way to get the list of indexed words from Marklogic universal index

I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??
Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!
What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.
"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms
I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not

Resources