How to query by nearest neighbor on Cloud Firestore? - firebase

I'm quite new on the Firebase ecosystem and I'm wondering if there's a way to use a smart querying system using N-dimensional features vector.
We're trying to deploy a face-recognition application which, after computing its encoding (a vector of 128 features basically), tests it against the database (like cloud Firestore) to find the closest matching. From what I've understood the same task is usually achieved using PostgreSQL, apache sorl, etc. indexing the 128 fields and using a cube operator or a euclidean-distance like query with some quite reasonable timings. I think there's already something similar for geo-locations queries (Geofire).
Is there a way or some alternative options to perform this kind of task?

Firestore queries can only perform comparisons along a single axis. There is no built-in way for them to perform comparisons on multiple values, which is why GeoFire has to use geohash values to emulate multi-axis comparisons.
That is also your only option when it comes to other multi-value comparisons: package the values into a single value in a way that allows comparing those packed values in the way you need.
If your use-case also requires that you can compare two scalar values, and then get a range from within the resulting two-dimensional space, you can probably use a similar scheme of packing the two values into a string bit-by-bit. You might event be able to use GeoFire as a starting point, although you'll need to modify it to remove the fact that geofire's points are on a sphere.
If that doesn't work for you, I recommend looking at one of the solutions that more natively has support for your use-case.

Related

How to model data in dynamodb if your access pattern includes many WHERE conditions

I am a bit confused if this is possible in DynamoDB.
I will give an example of SQL and explain how the query could be optimized and then I will try to explain why I am confused on how to model this and how to access the same data in DynamoDB.
This is not company code. Just an example I made up based on pcpartpicker filter.
SELECT * FROM BUILDS
WHERE CPU='Intel' AND 'OVERCLOCKED'='true'
AND Price < 3000
AND GPU='GeForce RTX 3060'
AND ...
From my understanding, SQL will first do a scan on the BUILDS table and then filter out all the builds where CPU is using intel. From this subset, it then does another WHERE clause to filter 'OVERCLOCEKD' = true so on and so forth. Basically, all of the additional WHERE clauses have a smaller number of rows to filter.
One thing we can do to speed up this query is to create an index on these columns. The main increase in performance is reducing the initial scan on the whole table for the first clause that the database looks at. So in the example above instead of scanning the whole db to find builds that are using intel it can quickly retrieve them since it is indexed.
How would you model this data in DynamoDB? I know you can create a bunch of secondary Indexes but instead of letting the engine do the WHERE clause and passing along the result to do the next set of filtering. It seems like you would have to do all of this yourself. For example, we would need to use our secondary index to find all the builds that use intel, overclocked, less than 3000, and using a specific GPU and then we would need to find the intersection ourselves. Is there a better way to map out this access pattern? I am having a hard time figuring out if this is even possible.
EDIT:
I know I could also just use a normal filter but it seems like this would be pretty expensive since it basically brute force search through the table similar to the SQL solution without indexing.
To see what I mean from pcpartpicker here is the link to the site with this page: https://pcpartpicker.com/builds/
People basically select multiple filters so it makes designing for access patterns even harder.
I'd highly recommend going through the various AWS presentations on YouTube...
In particular here's a link to The Iron Triangle of Purpose - PIE Theorem chapter of the AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301) presentation.
DynamoDB provides IE - Infinite Scale and Efficiency.
But you need P - Pattern Flexibility.
You'll need to decide if you need PI or PE.

Is there a workaround for the Firebase Query "NOT-IN" Limit to 10?

I saw a similar question here: Is there a workaround for the Firebase Query "IN" Limit to 10?
The point now is, with the query in, the union works, but with the query
not-in it will be intersection and give me all the documents, anyone knows how to do this?
As #samthecodingman mentioned, it's hard to provide specific advice without examples / code, but I've had to deal with this a few times and there are a few generalized strategies you can take:
Restructure your data - There's no limit on the number of equality operators you can use You can use up to 100 equality operators, so one possible approach is to store your filters/tags as a map, for example:
id: 1234567890,
...
filters: {
filter1: true,
filter2: true,
filter3: true,
}
If a doc doesn't have a particular tag, you could simply omit it, or you could set it to false, depending on your use case.
Note, however, that you may need to create composite indexes if you want to combine equality operators with inequality operators (see the docs). If you have too many filters, this will get unwieldy quickly.
Query everything and cache locally - As you mentioned, fetching all the data repeatedly can get expensive. But if it doesn't change too often or it isn't critical to get the changes in real time, you can cache it locally and refresh at some interval (hourly or daily, for example).
Implement Full-Text Search - If neither of the previous options will work for you, you can always implement full-text search using one of the services Firebase recommends like Elastic. These are typically far more efficient for use-cases with a high number of tags/filters, but obviously there's an upfront time cost for setup and potentially an ongoing monetary cost if your usage is higher than the free tiers these services offer.

Vector based search in solr

I am trying to implement dense vector based search in solr (currently using version 8.5.2). My requirement is
to store a dense vector representation for each document in solr in a field called vectorForm.
Now when a user issues some query, I am converting that query to some dense vector representation as well and now I want to get top 100 documents from solr that have highest dotProduct value between the query vector representation and vectorForm field (stored for each document above) in solr.
A few questions that I had around this are
What field type should be used to define the vectorForm field (does docValues with multiValued integers work best here)?
How do I efficiently do the above vector based retrieval? (keeping in mind that latency should be as low as possible)
I read that solr has dotProduct and cosinSimilarity functions but not able to understand how to use it here in my case, if thats the solution then any link towards an example implementation will help.
Any help or guidance will be a huge help for me.
You can use "dense vector search" starting with Solr 9.0.
https://solr.apache.org/guide/solr/9_0/query-guide/dense-vector-search.html
Neural Search has been released with Apache Solr 9.0.
The DenseVectorField gives the possibility of indexing and searching dense vectors of float elements, defining parameters such as the dimension of the dense vector to pass in, the similarity function to use, the knn algorithm to use, etc...
Currently, it is still necessary to produce the vectors externally and then push the obtained embeddings into Solr.
At query time you can use the k-nearest neighbors (knn) query parser that allows finding the k-nearest documents to the query vector according to indexed dense vectors in the given field.
Here is our End-to-End Vector Search Tutorial that can definitely help you understand how to leverage this new Solr feature to improve the user search experience
https://sease.io/2023/01/apache-solr-neural-search-tutorial.html

Reduce numbers of queries in OpenTSDB

I used opentsdb to save my time series data. Of each data point input, I must get 20 value of data points before. But, I have a large numbers of metrics, I can not call query opentsdb api too many times. How can I do to reduce numbers of query from openTSDB?
As far as I know you can't aggregate different metrics into one single result. But I would suggest two solutions:
You can put multiple metrics queries in one call. If you use HTTP
API endpoint you can do something like this:
http://otsdb:4242/api/query?start=15m-ago&m=avg:metric1{tag1=a}&m=avg:metric2{tag2=b}
You get the results for all queries with the same start(end) dates/times. But with multiple metrics don't forget that it will take longer time...
Redefine your time series.I don't know any details about your data, but if you're going to store and use data you should also think about usage - What queries am I going to use? How often? What would be the most common access to the data? And so on...
That's also what's advised from OpenTSDB documentation [1]:
Cardinality also affects query speed a great deal, so consider the queries you will be performing frequently and optimize your naming schema for those.
So, I would suggest to use tags to overcome this issue of multiple metrics. But as I mentioned I don't know your schema, but OpenTSDB is much more powerful with tags - there are many examples and also filtering options as well.
Edit 1:
From OpenTSDB 2.3 version there is also expression api: http://opentsdb.net/docs/build/html/api_http/query/exp.html
You should be able to handle multiple metric queries together (but I've never used that for any query).

Firebase/GeoFire - Most popular item at location

I am currently in the evaluation process for a database that should serve as a backend for a mobile application.
Right now I am looking at Firebase, and for now I like it really much.
It is a requirement to have the possibility to fetch the
most popular items
at a certain location
(possibly in the future: additionally for a certain time range that would be an attribute of the item)
from the database.
So naturally I stumbled upon GeoFire that provides location based query possibilities for Firebase.
Unfortunately - at least as far as I understood - there is no possibility to order the results by an attribute other than the distance. (correct me if I am wrong).
So what do I do if I am not interested in the distance (I only want to have items in a certain radius, no matter how far from the center) but in the popularity factor (e.g. for the sake of simplicity a simple number that symbolizes popularity)?
IMPORTANT:
Filtering/Sorting on the client-side is not an option (or at least the least preferred one), as the result set could potentially grow to an infinite amount.
First version of the application will be for android, so the Firebase Java Client Library would be used in the first step.
Are there possibilities to solve this or is Firebase out of the race and not the right candidate for the job?
There is no way to add an extra condition to the server-side query of Geofire.
The query model of the Firebase database allows filtering only on a single property. Geofire already performs a seemingly impossible feat of filtering on both latitude and longitude. It does this through the magic of Geohashes, which combine latitude and longitude into a single string.
Strictly speaking you could find a way to extend the magic of Geohashes to add a third number into the mix. But while possible, I doubt it's feasible for most of us.

Resources