Vector based search in solr

Vector based search in solr - vector

I am trying to implement dense vector based search in solr (currently using version 8.5.2). My requirement is
to store a dense vector representation for each document in solr in a field called vectorForm.
Now when a user issues some query, I am converting that query to some dense vector representation as well and now I want to get top 100 documents from solr that have highest dotProduct value between the query vector representation and vectorForm field (stored for each document above) in solr.
A few questions that I had around this are
What field type should be used to define the vectorForm field (does docValues with multiValued integers work best here)?
How do I efficiently do the above vector based retrieval? (keeping in mind that latency should be as low as possible)
I read that solr has dotProduct and cosinSimilarity functions but not able to understand how to use it here in my case, if thats the solution then any link towards an example implementation will help.
Any help or guidance will be a huge help for me.

You can use "dense vector search" starting with Solr 9.0.
https://solr.apache.org/guide/solr/9_0/query-guide/dense-vector-search.html

Neural Search has been released with Apache Solr 9.0.
The DenseVectorField gives the possibility of indexing and searching dense vectors of float elements, defining parameters such as the dimension of the dense vector to pass in, the similarity function to use, the knn algorithm to use, etc...
Currently, it is still necessary to produce the vectors externally and then push the obtained embeddings into Solr.
At query time you can use the k-nearest neighbors (knn) query parser that allows finding the k-nearest documents to the query vector according to indexed dense vectors in the given field.
Here is our End-to-End Vector Search Tutorial that can definitely help you understand how to leverage this new Solr feature to improve the user search experience
https://sease.io/2023/01/apache-solr-neural-search-tutorial.html

Related

Efficiently Query Solr 9 for Similarity Score with Filter Queries

I'm using Solr 9 for optimal query-document similarity calculations. I have a use-case where I have to query for specific field values first, and then compute document similarities on all of the documents that are found.
My problem is as follows:
If each document has a field "embedding" and "id", I want to only retrieve documents with id=1,2,3, and given a query embedding, return the similarity score of each document with the query embedding.
Option 1: Query for the id's using fq, and the q field using knn. Not all documents that I want will be returned because of the limitation below.
The main issue with this is documented here:
When using knn in re-ranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the k-nearest neighbors(in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current limitation.
Option 2: Query for the id's using fq, get the embedding in the field list, and compute the similarities in memory. The issue with that is the network latency, since the size of the response from Solr is large when retrieving the embeddings.
That leaves the following two questions:
When will the limitation in the document above be solved, if at all?
Is there a way to compress the response from Solr such that I can retrieve the response faster?
Thanks!

Does weaviate support dot product similarity when using the python sdk

I have saved vectors in Weaviate that I want to query using dot product.
I'm using the python sdk and I just don't see anyway of specifying this.
Does anyone know if this is possible/not possible?

Hi and thanks for your question.
The simple answer as of writing this is "not yet, but soon", but I think I need to elaborate a bit to explain more.
Distance Functions
Generally, distance functions in Weaviate are entirely pluggable. Anything that can produce a score can be plugged in. For example, see this folder. In fact, you will even see a file named dot_product.go in there. This is because internally for calculating the cosine sim, Weaviate will normalize all vectors on read and then just calculate the dot product.
APIs
So, if Weaviate can calculate the dot product why can't you select this option? This is because of a past decision to introduce the certainty field in the API. This field is used to return scores and to limit results by score. The original idea behind the certainty was that we would want a single metric that can produce a number between 0 and 1 to indicate the distance. With cosine sim that's simple, as this is already in the range of -1, 1, so it's very easy to transform it into a certainty. With an unbounded score such as dot product, this isn't so easy.
Path forward
Here is a discussion on this topic. Feel free to participate in this discussion. The current favorite option is to deprecate certainty and expose the raw values as either score or distance.
Any quickfixes?
We could easily enable new distance scores, such as dot product before the above mentioned API issue is solved. Possibly as an experimental feature using a feature flag. However, you would not be able to see the resulting scores/distances in the APIs.
Timelines
I expect the above mentioned issue to be resolved in a couple of weeks as of writing this (April 28, 2022).

How to query by nearest neighbor on Cloud Firestore?

I'm quite new on the Firebase ecosystem and I'm wondering if there's a way to use a smart querying system using N-dimensional features vector.
We're trying to deploy a face-recognition application which, after computing its encoding (a vector of 128 features basically), tests it against the database (like cloud Firestore) to find the closest matching. From what I've understood the same task is usually achieved using PostgreSQL, apache sorl, etc. indexing the 128 fields and using a cube operator or a euclidean-distance like query with some quite reasonable timings. I think there's already something similar for geo-locations queries (Geofire).
Is there a way or some alternative options to perform this kind of task?

Firestore queries can only perform comparisons along a single axis. There is no built-in way for them to perform comparisons on multiple values, which is why GeoFire has to use geohash values to emulate multi-axis comparisons.
That is also your only option when it comes to other multi-value comparisons: package the values into a single value in a way that allows comparing those packed values in the way you need.
If your use-case also requires that you can compare two scalar values, and then get a range from within the resulting two-dimensional space, you can probably use a similar scheme of packing the two values into a string bit-by-bit. You might event be able to use GeoFire as a starting point, although you'll need to modify it to remove the fact that geofire's points are on a sphere.
If that doesn't work for you, I recommend looking at one of the solutions that more natively has support for your use-case.

OpenTSDB indexing on keys

As I've worked in my personal lab instance of OpenTSDB, I've started to wonder if it is possible to get it to index on tags as well as metric names. My understanding (correction is welcome...) is that OpenTSDB indexes only on metric names. So, suppose I have something like the following, borrowed from the docs:
tsd.hbase.rpcs{type=*,host=tsd1}
My understanding is that tsd.hbase.rpcs is indexed for searching, but that the keys (type=, host=, etc) are not. Is that correct? If so, is there a way to have them be indexed, or some reasonable approximation of it? Thanks.

Yes you are correct, according to the documentation, OpenTSDB creates keys in the 'tsdb' HBase table of the form
[salt]<metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]
When you do a query with specific tagk and tagv OpenTSDB can construct the key and look it up. If you have a range of tagk and tagv it will look up all the rows and either aggregate them or return multiple time series, depending on your query.
If you are interested in asking questions about tagks, you should use the OpenTSDB search/lookup api, however this still requires a metric name.
If you want to formulate your question around tagks only, you could consider forwarding your data to Bosun for indexing and using its API
/api/metric/{tagk}/{tagv}
Returns the metrics that are available for the specified tagk/tagv pair. For example, you can see what metrics are available for host=server01

Is there a way to get the list of indexed words from Marklogic universal index

I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??

Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!

What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.

"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms

I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not