I have just read the documentation for Firestore indexing and now I have the following questions to make sure that I understood the concept correctly :
Assume that I have the following data structure:
{
"user_collection": {
"user1_document":{
"name": "Joe",
"age": 21
},
"user2_document":{
"name": "Sarah",
"age": 29
},
"user3_document":{
"name": "Sarah",
"age": 24
}
}
}
If I now perform a query that returns every document with the name Sarah, Firestore looks through every index record of the field name and returns every document where the name value equals "Sarah". Did I understand that correctly?
My next question is a little bit more specific: indexes are sorted(in ascending and descending order). Now, when a query is looking for every document where the user's age is smaller than 20, would Firestore start with the age 21, notice that the smallest age in the user collection is 21, and therefore stop checking any further document OR would Firestore still go through all the remaining documents? Generally, is there any information about what algorithm Firestore uses to search indexes, like binary search?
I know this information is irrelevant in terms of working with Firebase, but it just interests me.
If I now perform a query that returns every document with the name Sarah, Firestore looks through every index record of the field name and returns every document where the name value equals "Sarah". Did I understand that correctly?
Yes, and you'll have to pay a document read for each document the query returns. If however, your query yields no result, according to the official documentation regarding Firestore pricing, it is said:
Minimum charge for queries
There is a minimum charge of one document read for each query that you perform, even if the query returns no results.
So if, for example, you try to filter all users and you get no results, you're still charged with 1 read.
When a query is looking for every document where the user's age is smaller than 20, would Firestore start with the age 21, notice that the smallest age in the user collection is 21, and therefore stop checking any further document OR would Firestore still go through all the remaining documents?
No. When you're looking for every document where the user's age is less than 20, Firestore will return all documents where the age field holds a value that is less than 20. It would have returned documents where the field age holds a value of 20 if you were looking for every document where the user's age is less than or equal to 20.
Yes, in order to provide some results, Firestore will have to check all documents against a value.
Generally, is there any information about what algorithm Firestore uses to search indexes, like binary search?
I'm not aware of something public about the Firestore algorithm, but if I find something I will update my answer.
Please also note that in Firestore, we are not only charged based on the number of reads/writes/deletes we perform but also based on space. So we have to pay for what we consume, including storage overhead. What does that mean? It means that we have to pay for the metadata, automatic indexes, and composite indexes.
The single key indexes can be consider as a value -> docId mapping in short. As per your database structure, an index on field 'name' would be like this:
"Sarah": "user1_id",
"Sarah": "user2_id",
"Sarah": "user3_id",
For an index on field age, the index structure would be:
"21": "user1_id",
"29": "user2_id",
"24": "user3_id",
When you run a query and an index supporting the same exists, it just has to read those index entries.
Every document where the user's age is smaller than 20, would Firestore start with the age 21,
In case of where("age", "<", 20) (and you have no document matching the query), there are no index entries for the same and hence no data is returned i.e. no other entries are read. However, it'll still cost you a read as Alex mentioned.
Additionally, if you want to query based on both the fields, you would need a composite index e.g. { name: ASC, age: ASC }:
{"Sarah", 21}: "user1_id",
{"Sarah", 29}: "user2_id"
{"Sarah", 24}: "user3_id"
Whenever you create a new document, all the indexes related are updated so creating many indexes may slow down write operations generally. Databases (like MongoDB) generally use B-Trees. If you are curious about Firestore then it might be a good idea to contact Firebase.
Related
I have a collection where the documents are uniquely identified by a date, and I want to get the n most recent documents. My first thought was to use the date as a document ID, and then my query would sort by ID in descending order. Something like .orderBy(FieldPath.documentId, descending: true).limit(n). This does not work, because it requires an index, which can't be created because __name__ only indexes are not supported.
My next attempt was to use .limitToLast(n) with the default sort, which is documented here.
By default, Cloud Firestore retrieves all documents that satisfy the query in ascending order by document ID
According to that snippet from the docs, .limitToLast(n) should work. However, because I didn't specify a sort, it says I can't limit the results. To fix this, I tried .orderBy(FieldPath.documentId).limitToLast(n), which should be equivalent. This, for some reason, gives me an error saying I need an index. I can't create it for the same reason I couldn't create the previous one, but I don't think I should need to because they must already have an index like that in order to implement the default ordering.
Should I just give up and copy the document ID into the document as a field, so I can sort that way? I know it should be easy from an algorithms perspective to do what I'm trying to do, but I haven't been able to figure out how to do it using the API. Am I missing something?
Edit: I didn't realize this was important, but I'm using the flutterfire firestore library.
A few points. It is ALWAYS a good practice to use random, well distributed documentId's in firestore for scale and efficiency. Related to that, there is effectively NO WAY to query by documentId - and in the few circumstances you can use it (especially for a range, which is possible but VERY tricky, as it requires inequalities, and you can only do inequalities on one field). IF there's a reason to search on an ID, yes it is PERFECTLY appropriate to store in the document as well - in fact, my wrapper library always does this.
the correct notation, btw, would be FieldPath.documentId() (method, not constant) - alternatively, __name__ - but I believe this only works in Queries. The reason it requested a new index is without the () it assumed you had a field named FieldPath with a subfield named documentid.
Further: FieldPath.documentId() does NOT generate the documentId at the server - it generates the FULL PATH to the document - see Firestore collection group query on documentId for a more complete explanation.
So net:
=> documentId's should be as random as possible within a collection; it's generally best to let Firestore generate them for you.
=> a valid exception is when you have ONE AND ONLY ONE sub-document under another - for example, every "user" document might have one and only one "forms of Id" document as a subcollection. It is valid to use the SAME ID as the parent document in this exceptional case.
=> anything you want to query should be a FIELD in a document,and generally simple fields.
=> WORD TO THE WISE: Firestore "arrays" are ABSOLUTELY NOT ARRAYS. They are ORDERED LISTS, generally in the order they were added to the array. The SDK presents them to the CLIENT as arrays, but Firestore it self does not STORE them as ACTUAL ARRAYS - THE NUMBER YOU SEE IN THE CONSOLE is the order, not an index. matching elements in an array (arrayContains, e.g.) requires matching the WHOLE element - if you store an ordered list of objects, you CANNOT query the "array" on sub-elements.
From what I've found:
FieldPath.documentId does not match on the documentId, but on the refPath (which it gets automatically if passed a document reference).
As such, since the documents are to be sorted by timestamp, it would be more ideal to create a timestamp fieldvalue for createdAt rather than a human-readable string which is prone to string length sorting over the value of the string.
From there, you can simply sort by date and limit to last. You can keep the document ID's as you intend.
After reading through the firestore documentation about indexing I want to confirm that this is how a single-field index would look like:
Let's say I have a firestore collection cars with following documents:
car123: {
brand:"Mercedes"
model:"W123",
},
car423: {
brand:"BMW",
model:"x5"
},
carXyZ: {
brand:"Mercedes",
model:"S 500"
}
Would the indices that firestore creates automatically really look something more or less like this?
index for queries filtering by brand equals "Mercedes" = ["car123", "carXyZ"]
index for queries filtering by brand equals "BMW" = ["car423"]
index for queries filtering model equals "S 500" = ["CarXyZ"]
..and does this mean each time a car is added n indices are updated whereas n is the number of keys that car has + of each index one ASC and one DESC version?
The examples provided by you are an oversimplification of the what an actual index looks like, according to the documentation:
A single-field index stores a sorted mapping of all the documents in a
collection that contain a specific field. Each entry in a single-field
index records a document's value for a specific field and the location
of the document in the database.
So according to the excerpt cited, an index at least has 4 things, a mapping with all the documents that have the single-field index, the single field index, the value of that index per document and the location of the document in the database, how that structure looks like is not provided publicly as far as I know.
Regarding your question, it's possible to infer, that each time a new document, with a field that is being part of a single-field index, is created, it will be added to two single field-index structures the Descending and Ascending one.
I am working on a project that when a user cancels their plan, all their documents should be updated to deactivated except for a pre-defined number of documents that are allowed to stay active. The pre-defined number amount determines the projects allowed to stay active along with the date they were created.
For example, if customer A has 1,000 documents and cancels their plan, all their documents except for the first 100 created should be updated to be deactivated.
My first attempt was to get all document ids with .listDocuments() but I noticed the created date is not part of Firestore's DocumentReference. Therefore I can't exclude the pre-defined number of documents allowed to stay active.
I could use .get() and use the created value, but I'm afraid that getting all the documents at once (which could be a million) would cause my cloud function to run out of memory, even if I have it set to the maximum allowed configuration.
Another option that I thought of, I could use .listDocuments() and write each document id to a temp collection in Firestore, which could kick off a cloud function for each document. This function would only have to work with one document, so it wouldn't have to worry about running out of resources. I am unsure how to determine if the document I'm working on should be marked as deactivated or is allowed to stay active.
I am not that worried about the reads to write as this workflow should not happen very often. Any help would be appreciated.
Thank you
One possible approach would be to mark the documents to be excluded.
I don't know what is your exact algorithm, but if you want to mark the first 100 documents that were created in a collection you can use a Cloud Function that runs for each new document and checks if there are already 100 docs in the collection.
If not, you update a field in this new document with its rank (using a Transaction to get the highest existing rank and increment it). If there are already 100 documents previously created in the collection, you just update the field to 0, for example, in such a way that later on you can query with where("rank", "==", 0).
Then, when you want to delete all the docs but the 100 first ones, just use where("rank", "==", 0) query.
So, concretely:
The first doc is created: you set the rank field to 1.
The Nth doc (N != 1) is created: you fetch all the docs with a query ordered by rank and limited to 1 doc (collecRef.orderBy("rank", "desc").limit(1)) in a Transaction. Since you are in a Cloud Function, you can use a Query in the Transaction (which you cannot with the Client SDKs). Then, still in the Transaction:
If the value of rank for the single doc returned by the Query is < 100 you set the rank value of the newly created do to [single doc value + 1]
If the value of rank for the single doc returned by the Query = 100 you set the rank value to 0
If I didn't make any mistake (I didn't test it! :-)) you end with 100 docs with a value of rank between 1 and 100 (the 100 first created docs) and the rest of the docs with a value of rank equal to 0.
Then, as said above you can use the where("rank", "==", 0) query to select all the docs to be deleted.
I was looking for a solution to Firestore's limitation of Sequential indexed fields which means the following from this doc.
"Sequential indexed fields" means any collection of documents that
contains a monotonically increasing or decreasing indexed field. In
many cases, this means a timestamp field, but any monotonically
increasing or decreasing field value can trigger the write limit of
500 writes per second.
As per the solution, I can add a shard field in my collection which will contain random value and create a composite index with the timestamp. I am trying to achieve this with the existing fields I have in my Document.
My document has the following fields:
{
users: string[],
createdDate: Firebase Timestamp
....
}
I already have a composite index created: users Arrays createdDate Descending. Also, I have created Exemptions for the fields field from Automatic index settings. The users field will contain a list of firebase auto-generated IDs so definitely its random. Now I am not sure whether the field users will do the job of field shard form the example doc. In this way we can avoid adding a new field and still increase the write rate. Can someone please help me with this?
While I don't have specific experience that says what you're trying to do definitely will or will not work the way you expect, I would assume that it works, based on the fact that the documentation says (emphasis mine):
Add a shard field alongside the timestamp field. Use 1..n distinct values for the shard field. This raises the write limit for the collection to 500*n, but you must aggregate n queries.
If each users array contains different and essentially random user IDs, then the array field values would be considered "distinct" (as two arrays are only equal if their elements are all equal to each other), and therefore suitable for sharding.
Right now I have a products collection where I store my products as documents like the following:
documentID:
title: STRING,
price: NUMBER,
images: ARRAY OF OBJECTS,
userImages: ARRAY OF OBJECTS,
thumbnail: STRING,
category: STRING
NOTE: My web app has approximately 1000 products.
I'm thinking about doing full text search on client side, while also saving on database reads, so I'm thinking about duplicating my data on Firestore and save a partial copy of all of my products into a single document to send that to the client so I can implement client full text search with that.
I would create the allProducts collection, with a single document with 1000 fields. Is this possible?
allProducts: collection
Contains a single document with the following fields:
Every field would contain a MAP (object) with product details.
document_1_ID: { // Same ID as the 'products' collection
title: STRING,
price: NUMBER,
category: STRING,
thumbnail
},
document_2_ID: {
title: STRING,
price: NUMBER,
category: STRING,
thumbnail
},
// AND SO ON...
NOTE: I would still keep the products collection intact.
QUESTION
Is it possible to have a single document with 1000 fields? What is the limit?
I'm looking into this, because since I'm performing client full text search, every user will need to have access to my whole database of products. And I don't want every user to read every single document that I have, because I imagine that the costs of that would not scale very well.
NOTE2: I know that the maximum size for a document is 1mb.
According to this document, in addition to the 1MB limit per document, there is a limit of index entries per document, which is 40,000. Because each field appears in 2 indexes (ascending and descending), the maximum number of fields is 20,000.
I made a Node.js program to test it and I can confirm that I can create 20,000 fields but I cannot create 20,001.
If you try to set more than 20,000 fields, you will get the exception:
INVALID_ARGUMENT: too many index entries for entity
// Setting 20001 here throws "INVALID_ARGUMENT: too many index entries for entity"
const indexPage = Array.from(Array(20000).keys()).reduce((acc, cur) => {
acc[`index-${cur}`] = cur;
return acc;
}, {});
await db.doc(`test/doc`).set(indexPage);
I would create the allProducts collection, with a single document with 1000 fields. Is this possible?
There isn't quite a fixed limitation for that. However, the documentation recommends having fewer than 100 fields per document:
Limit the number of fields per document: 100
So the problem isn't the fact that you duplicate data, the problem is that the documents have another limitation that you should care about. So you're also limited to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your documents get bigger, be careful about this limitation.
If you are storing a large amount of data in your documents and those documents should be updated by lots of admins, there is another limitation that you need to take care of. So you are limited to 1 write per second on every document. So if you have a situation in which the admins are trying to write/update products in that same document all at once, you might start to see some of these writes fail. So, be careful about this limitation too.
And the last limitation is for index entries per document. So if you decide to get over the first limitation, please note that the maximum limit is set to 40,000. Because each field has associated two indexes (ascending and descending), the max number of fields is 20,000.
Is it possible to have a single document with 1000 fields?
It is possible up to 40,000 properties but in your case with no benefits. I say that because every time you perform a query (get the document), only a single document will be returned. So there is no way you can implement a search algorithm in a single document and expect to get Product objects in return.
And I don't want every user to read every single document that I have, because I imagine that the costs of that would not scale very well.
Downloading an entire collection to search for fields client-side isn't practical at all and is also very costly. That's the reason why the official documentation recommends a third-party search service like Algolia.
For Android, please see my answer in the following post:
Is it possible to use Algolia query in FirestoreRecyclerOptions?
Firebase has a limit of 20k fields per document.
https://www.youtube.com/watch?v=o7d5Zeic63s
According to the documentation, there is no stated limit placed on the number of fields in a document. However, a document can only have up to 40,000 index entries, which will grow as documents contain more fields that are indexed by default.