I am considering storing multiple tenants in a single Firebase Firestore database. There will only be one collection per tenant and a few shared collections. Some will have more data than others. Some tenants may have a few million records while others may end up with a few billion. I want to confirm that the size of data in one collection will not impact the performance or storage of another collection in the same database.
I couldn't find much in the documentation about how the data is physically stored. Is all the data in Firestore stored in a single blob/file? If so, this could be a problem when there are hundreds of tenants with billions of records each. In an ideal world, each collection would be a physically separate file, and the server orchestration would separate the collections onto multiple servers so that a single server is not sharing the load between a very heavy tenant, and a very light tenant. This scenario would mean that a heavy tenant would slow down a light tenant.
My basic question is: can a single Firestore database infinitely scale up in size assuming that no single collection is bigger than a few billion records?
I know that there are two types of databases: native and datastore. Which of these seems more appropriate, and is the answer to my question different depending on which of these I select?
If the answer is that Firestore cannot scale infinitely in this way, what is the alternative approach? Should I be using Bigtable instead? Cassandra? Or, is there another way to physically divide my Firestore database other than collections?
Some tenants may have a few million records while others may end up with a few billion. I want to confirm that the size of data in one collection will not impact the performance or storage of another collection in the same database.
The performance in Firestore isn't related to the number of documents that exist in a collection. In terms of speed, it doesn't matter if you perform a query on:
A top-level (root-level) collection.
A sub-collection, which basically represents a collection that is nested under a document.
A collection group, which actually means querying collections and sub-collections that exist across the entire database.
The speed will always be the same, as long as the query returns the same number of documents. This is happening because the query performance depends on the number of documents you request and not on the number of documents you search. So it doesn't really matter if you query a collection with 1 MILLION documents or even 1 BILLION documents, the time for getting the same results will be the same.
I couldn't find much in the documentation about how the data is physically stored. Is all the data in Firestore stored in a single blob/file? If so, this could be a problem when there are hundreds of tenants with billions of records each.
In Cloud Firestore, the unit of storage is the document. Documents live in collections, which are simply containers for documents. Please note that Firestore is optimized for storing large collections of small documents. And when I say large, I mean extremely large. So when you perform a query against a collection of 1 MILLION documents, the speed depends on the number of results you return and it does not depend on the number of the documents in which you search, or on the number of documents that exist in other collections in which you aren't performing a search.
Can a single Firestore database infinitely scale up in size assuming that no single collection is bigger than a few billion records?
While when using the Firebase Realtime Database you had to scale using multiple databases, in Firestore this practice is not necessary. However, the are some techniques that are really good explained in the official docs:
Building scalable applications with Firestore
If the answer is that Firestore cannot scale infinitely in this way, what is the alternative approach?
I can definitely massively scale.
See the Firestore best practices and security rules.
You may conceptualize Firestore as being one service being shared by all of Google's customers. Just as Google's attempts to ensure that one customer's (so-called "noisy neighbor") impact on the service does not affect others, you don't want to be a noisy neighbor to yourself.
You need to consider more than just performance.
Security. E.g.see security rules as a mechanism that you may be able to use to help enforce segregation of your tenants' data. You will want to understand fully how to keep different customers' data separated securely. Your customers will want to understand what measures you're employing to ensure their data is keep separate too.
Multitenancy. Google Cloud Platform has no intrinsic (platform-wide) multitenant capabilities and, often, a way to manifest tenancy has been to use different Google Projects for different customers. This is because Projects provide a well-defined security perimeter. You may want to investigate whether (some subset of your customers) would benefit from being one customer, one project.
Quota. Another important consideration is quota. Every Cloud Platform method is constrained by some quota. You will want to be careful in ensuring that quota is distributed fairly across customers so that some customers don't consume all the quota denying other customers access to the service.
Related
I'm currently brainstorming and wondering if it's possible to easily communicate among multiple firestore databases. If so, I could isolate collections and therefore also isolate writes/updates on those collections from competing with other services reducing the risk that I hit the 10,000 write limit p/second on a given database.
Conceptually, I figure I can capture the necessary information from one document in DB_A (including the doc_id) in a read and then set that document in DB_B with the matching doc_id.
In a working example, perhaps one page has a lot of content (documents) that I need to generate and I don't want those writes to compete with writes used in other services on my app. When a user visits this page, we show those documents from DB_A and if the user is interested in one of those documents, we can take that document that we've effectively already read, and now write it into DB_B where user-specific content lives. It seems practical enough. Are there any indexing problems / other problems that could come out of this solution that I'm not seeing?
In the example you give the databases themselves are not communicating, but your app is communicating with multiple database instances. That is indeed possible. Since you can only have one Firestore instance per project, you will need to add multiple projects to your app.
What you're describing is known as sharding, as each database becomes a shard of (a subset of) your entire data set.
Note that it is quite uncommon to have shards to Firestore. If you predict such a high volume of writes, also have a look at Firebase's Realtime Database - as that is typically better suited for use-cases with more, small writes. Firestore is more suited for use-cases that have fewer larger writes, and many more readers. While you may also still to shard (and possibly shard more to reach the same read capacity) with Realtime Database, it can have multiple database instances per project - making the process easier to manage.
this is my first time using Firestore and I am confused about the limit number of collections that I can create. Is there a limit?
-I need suggestions for another thing as well. I am building an app that will require different tables in the database such as Restaurants, Clients and Reservations. In Firestore there are no tables since it is a non-SQL DB, so does a 'Collection' serve as a 'Table'? What about 'Document'?
The documentation doesn't say anything about maximum number of collections. They are essentially just containers for documents, so there is no practical limit that you should be concerned about.
A SQL table is roughly analogous to a Cloud Firestore collection. A SQL row is roughly analogous to a document. It's advisable to think of Cloud Firestore not in terms of what you know in SQL, but on its own terms.
When using Firestore and subscribing to document updates, it states a limit of 1M concurrent mobile/web connections per database.
https://firebase.google.com/docs/firestore/quotas#realtime_updates
Is that a hard limit (enforced/throttled in code)? Or is it a theoretical limit (like you're safe up to 1M, then things get dicey)? Is it possible to get an uplift?
Trying to understand how to support a large user base without needing to shard the database (which is one of the advantages of Firestore). Even at 5M users, it seems you would start having problems because you'd probably hit times when >20% of those users were on your app simultaneously.
As you already noticed, the maximum size of a single document in Firestore is 1 Megabyte. Trying to store large number of objects (maps) that may exceed this limitation, is generally considered a bad design.
You should reconsider the logic of you app and think at the reson why you need to have more than 1Mib in single a document, rather than each object being their own document. So to be able to use Firestore, you should change the way you are holding the data from within a single documents to a collection. In case of collections, there are no limitations. You can add as many documents as you want. According to the official documentation regarding Cloud Firestore Data model:
Cloud Firestore is optimized for storing large collections of small documents.
IMHO, you should take advantage of this feature.
For details, I recommend you see my answer from this post where I have explained some practices regarding storing data in arrays (documents), maps or collections.
Edit:
Without sharding, I'm affraid it is not an option. So in this case, sharding will work for sure. So in my opinion, that's certainly a reasonable option.
Edit: After posting the question I thought I could also make this post a quick reference for those of you needs a quick peek at some of the differences between these two technologies which might help you decide on one of them eventually. I will be editing this question and adding more info as I learn more.
I have decided to use firebase for the backend of my project. For firestore is says "the next generation of the realtime database". Now I am trying to decide which way to go. Realtime database or cloud firestore?
Billing:
At a first glance, it looks like firestore charges per number of results returned, number of reads, number of writes/updates etc. Real-time database charges based on the data transmitted. The number of read-write operations is irrelevant. They both also charge on the data stored on the google servers too (I think in this respect firestore is cheaper one). Why am I mentioning this price point? Because from my point of view, although it might a lower weight, it is also a point to consider while choosing the one over the other.
Scaling:
Cloudstore seems to scale horizontally seamlessly. I think this is not possible with the real-time database.
Edit:
In the real-time database, you need to shard your data yourself using multiple databases. And you can only do this if you are in BLAZE pracing plan.
ref: https://firebase.google.com/docs/database/usage/sharding
Performance & Indexing:
Another thing is the real-time database data structure is different in both. The real-time database stores the data as a JSON object in any way we structure them. Firestore structures the data as collections and documents. And hence the querying also changes between the two.
I think firestore does auto indexing which increases the read performance greatly too (which will decrease read performance). I am not sure if this is also the case with the real-time database.
Edit:
The real-time database does not automatically index your data. You need to do it yourself after a solid inspection of your data and your needs.
ref:https://firebase.google.com/docs/database/security/indexing-data
What other differences can you think of?
What would be (or has been) your choice for different types of projects?
Do you still go with the real-time database or have you migrated from that to the firestore? If so why?
And one last thing. How would you compare the SDKs of these two?
Thanks a lot!
What other differences can you think of?
what i think, ok. I use realtime-database for 6 months experience and difference is, firestore easy for sorting data. As Example, i want to retrieving user name based timestamp.
Query firstQuery = firestore.collection("Names").orderBy("timestamp", Query.Direction.DESCENDING).limit(10); // load 10 names
What would be (or has been) your choice for different types of
projects?
For me, Realtime-Database for Data Streaming when i work with Arduino, i want to store Drone Speed.
And Firestore for SMART OFFICE, like Air Conditioner, or light-room and Enterprise like Inventory Quantities, etc.
Do you still go with the real-time database or have you migrated from
that to the firestore? If so why?
still go with real-time because i need TREE for displaying streaming data strucure instead of query TABLE like firestore.
I have read in the documentation that the amount of time for retreiving data will be the same for querying a collection of 6 documents and a collection of 60M.
So is it safe to save all of the data of a specific kind (like users) under the same collection? Will I never have to split them into separate collections for getting better performance?
It is definitely possible to have slow-performing queries on Firestore, but the performance will not be related to the number of documents in the collection that you're querying. A common cause of slow reads is for example having documents that contain way more data than the application needs, which means that it takes more time to download that data to the client than is necessary for the use-case.
In your example: it is indeed normal to store all user profiles in a single collection. Querying 6 users out of that collection will always take the same amount of time, even if you app grows to millions or hundreds of millions of users.