Imagine you have a “Posts” model in firestore which has image, description, rating, comments etc. You want to display 10 or 15 comments when the post is clicked by user. Question is:
Would you store comments in “Posts” model as a field or would you create another new data model “Comments” for that?
In the first situation i wonder how to handle if the post has 1.000.000 comments? you can’t paginate a field as far as i know. Each time you need to fetch all of the comments and its kinda heavy and useless request i think. What is the best way to store comments?
Would you store comments in “Posts” model as a field or would you create another new data model “Comments” for that? In the first situation i wonder how to handle if the post has 1.000.000 comments?
There is no "100% correct" way of doing this, but your modeling should match the requirements of your expected use case. Without knowing how you are going to query this data, you might make a bad design decision.
Note that the maximum size of a Firestore document is 1MB. If you are expecting a large number of posts, then that simply will not fit inside a single document, and you should instead store each post as a separate document in a subcollection.
If you need to paginate any items at all, you should always store them as separate documents. Firestore queries can't fetch partial documents - a read always gets everything in the document.
Actually you have answered your question yourself :)
I would create an another data model / table so you can easier implement:
pagination
remove / edit
likes
answers to comments etc...
this brings more complexity, but is a more elegant and flexible solution.
The first solution only makes sense if it can't happen that there are 1 million comments. For example in an intranet application. But better not do it, because the effort is actually the same.
Related
I am writing an app where there is not a lot of interaction with other users. Set and retrieve your own data only.
In Firebase Firestore how could I model this so that everything fits under a users UID?
Something that would look like this?
users/{uid}/user/
users/{uid}/settings/
users/{uid}/weather/
If I want to achieve something like this, then I need to create another UID:
users/{uid}/user/{uid}/{userInfo}
This feels a bit off to me.
Is this wrong? Would it be better if I moved every subcollection into its own collection?
Is this faster / more efficient?
Any help is appreciated!
The most common approaches for me:
Store the profile information, settings and weather in the user document (your {uid}) itself. This most common for the profile information, but it's always worth considering for other types too: do they really need to be in their own documents?
Have a default name for a single subcollection for each user, and then have each information type as a document with a known name in there. So /users/$uid/documents/profile, /users/$uid/documents/settings, and /users/$uid/documents/weather. So now each information type is in a separate document, meaning you can for example secure access to them individually.
If the information for a certain type is repeated, I'd put that in documents in a known/named subcollection. So if there are many weathers, you'd get /users/$uid/weather/$weatherdocs. So with this you can now have an endless set of the specific type of information.
Neither of these is pertinently better/worse, as it all depends on the use-cases of your app.
There will be performance differences between these approaches, as they require a different number of network requests. If this is a concern for your app, I'd recommend testing all approaches above to measure their relative performance against your requirements.
I'm currently deciding on my Firestore data structure.
I'll need a products collection, and the products items will live inside of it as documents.
Here are my product's fields:
uniqueKey: string
description: array of strings
images: array of objects
price: number
QUESTION
Should I use Firestore auto-generated ID's to be the ID of my documents, or is it better to use my uniqueKey (which I'll query for in many occasions) as the document ID? Is there a best option between the 2?
I imagine that if I use my uniqueKey, it will make my life easier when retrieving a single document, but I'll have to query for more than 1 product on many occasions too.
Using my uniqueKey as ID:
db.collection("products").doc("myUniqueKey").get();
Using my Firestore auto-generated ID:
db.collection("products").where("uniqueKey", "==", "myUniqueKey").get();
Is this enough of a reason to go with my uniqueKey instead of the auto-generated one? Is there a rule of thumb here? What's the best practice in this case?
In terms of making queries from a client, using only the information you've given in the question, I don't see that there's much practical difference between a document get using its known ID, or a query on a field that is also unique. Either way, an index is used on the server side, and it costs exactly 1 document read. The document get() might be marginally faster, but it's not worthwhile to optimize like this (in my opinion).
When making decision about data modeling like this, it's more important to think about things like system behavior under load and security rules.
If you're reading and writing a lot of documents whose IDs have a sequential property, you could run into hotspotting on those writes. So, if you want to use your own ID, and you expect to be reading and writing them in that sequence under heavy load, you could have a problem. If you don't anticipate this to be the situation, then it likely doesn't matter too much whose ID you use.
If you are going to use security rules to limit access to documents, and you use the contents of other documents to help with that, you'll need to be able to uniquely identify those documents in your rule. You can't perform a query against a collection in rules, so you might need meaningful IDs that will give direct access when used by rules. If your own IDs can be used easily this way in security rules, that might be more convenient overall. If you're force to used Firestore's generated IDs, it might become inconvenient, difficult, or expensive to try to maintain a relationship between your IDs and Firestore's IDs.
In any event, the decision you're making is not just about which ID is "better" in a general sense, but which ID is better for your specific, anticipated situation, under load, with security in mind.
First of, I know how Firestore works and have spent a lot of time, evaluating different approaches for a good structure. Still I am considering following scenario:
There is a database of known recipes. Users can add recipes, but they have to be confirmed to be real recipes and not just some variations. So every user can choose receipes from the user-generated list of recipes to state, that they know how to cook them (or add new ones).
Now I want users to share their list of receipes with others, but this is where I am not sure how this can be best accomplished using Firestore. The trick is, that I want to show all the recipes at once, and don't want to paginate them.
I am currently evaluating two possibilities:
Subcollections
Whenever a user shares his list, the user looking at said list will have to load the entire list of the recipes which can result in a high amount of document reads (I suppose realistically ~50, in very rare cases maybe 1000).
Pros:
More natural structure
Easier to maintain (e.g. deleting a recipe, checking if a specific one exists)
Easier to add fields (e.g. timeOfCreation, comment, personalRating, ...)
Cons:
Can result in a high amount of reads on the long run
Arrays
I could save every known recipe (the id and an imageURL) inside the user's document (or as a single subdocument "KnownRecipes") within an array. This array could be in form of
recipesKnown: [{rid: 293ndwa, imageURL: image1.com, timeAdded: 8371201332},
{rid: 9012831, imageURL: image1.com, timeAdded: 8371201871},
{rid: jd812da, imageURL: image1.com, timeAdded: 8371201118},
...
]
Pros:
I only need one document read whenever someone wants to see another user's list
Reading a user's list is probably faster
Cons:
It's hard to update a specific recipe (e.g. someone wants to change the imageURL: I need to change the list locally and send the entire document as an update to the server - since I cannot just change a single element in the array)
When a user decides to have around 1000 recipes (this will maybe never happen, but it could), the 1MiB limit of the Firestore limit could be reached. A possible workaround would be to create a seperate document and split those two arrays into these two documents.
For me, the idea with Subcollections seems to be the more "clean" solution to this problem, but maybe I am missing some arguments on why one of those solutions would be superior over the other.
My most common queries are as follows (ordered descending by importance):
Which recipes can a user cook
Add a recipe a user can cook to the user's list
Who can cook a specific recipe (there is a Recipe -> Cooks subcollection)
Update an existing recipe a user can cook
The answer to your question depends on the level of scalability you want to achieve.
If by design the amount of sub-data you want to store is limited and very low, you should use arrays, since you reduce the number of document reads, which means lower costs.
If your sub-data is supposed to increase "unlimitedly" over time, you should use sub-collections.
If you're building a database which is not supposed to scale in any direction (Proof of concept, very small business, etc.) just go with what you feel more comfortable with.
I'm researching the same question...
One of the questions is whether the data held in the document will be ever go pass 1MB that is the limit for a document. Researching a bit on how much it can be held in plain text in 1MB well it's a hell of a lot. Still if it were to be incredible bigger it would crash in the end. Thus if you think in a big-big way sub-collections.
If we had to use the Firebase element logic the answer would be sub-collections.
Still I guess the major point is the data pulled. If you call the user you will directly be pulling out that MB of data. Instead with a sub-collection it won't load, even if you loaded it you can still lazy-load.
I guess for the kind of setup you are doing sub-collections.
key is an additional collection's con/pro
key could help to avoid duplicates; but this requires thinking of what is duplicate's definition (which might change);
array's no-key behavior could be emulated via auto-id.
p.s. #Thomas's list of pros/cons in the question has been quite helpful.
Some background:
My question is very similar to this clarification question about denormalization, but I want to change the situation a bit.
In the Considerations section of this blog post on denormalization, the Firebase people say the following about updating data.
Let’s discuss some consequences of a [denormalized data structure]. You will need to ensure that every time some data is created (in this case, a comment) it is put in the right places.
The example includes three paths, one to store the comment's data, and two paths under which to store pointers to that comment.
...
Modification of comments is easy: just set the value of the comment under /comments to the new content. For deletion, simply delete the comment from /comments — and whenever you come across a comment ID elsewhere in your code that doesn’t exist in /comments, you can assume it was deleted and proceed normally:
But this only works because, as the answer to the other question says,
The structure detailed in the blog post does not store duplicate comments. We store comments once under /comments then store the name of those comments under /links and /users. These function as pointers to the actual comment data.
Basically, the content is only stored in one location.
The question:
What if the situation were such that storing duplicate data is necessary? In that case, what is the recommended way to update data?
My attempt at an answer:
An answer to this question exists, but it is directed at MongoDB, and I'm not sure it quite addresses the issue in Firebase.
The most sensible way I could think of, just for reference, is as follows.
I have a helper class to which I give a catalog of paths in Firebase, which somewhat resembles a schema. This class has methods that wrap Firebase methods, so that I can perform writes and updates under all the paths specified by my schema. The helper class iterates over every path where there is a reference to the object, and at each location performs a write, update, or delete. In my case, no more than 4 paths exist for any individual operation like that, and most have 2.
Example:
Imagine I have three top-level keys, Users, Events, and Events-Metadata. Users post Images to Events, and both Events and Users have a nested record for all their respective Images. Events-Metadata is its own top-level key for the case where I want to display a bunch of events on a page, but I don't want to pull down potentially hundreds of Image records along with them.
Images can have captions, and thus, when updating an Image's caption, I should update these paths:
new Firebase("path/to/eventID/images/imageID/caption"),
and
new Firebase("path/to/userID/images/imageID/caption")
I give my helper class both those of these paths and a wrapper method, so that anytime a caption is updated, I can call helperclass.updateCaption(imageObj, newCaptionData), and it iteratively updates the data at each path.
Images are stored with attributes including eventID, userID, and imageID, so that the skeletons of those paths can be filled in correctly.
Is this a recommended and/or appropriate way to approach this issue? Am I doing this wrong?
I'm using ASP.net and an SQL database. I have a blog like system where a number of comments are made against a post and I want to display the number of those comments next to the post. To get that number I could either hold it in the post record and add/subtrack when a comment is added or deleted or I could use the SQL to calculate the number of comments using a query each time a user hits the page. The latter seems to be a bad idea as its going to hit my SQL database harder however holding the number against the record feels like it could be error prone. What do you think is best coding practice in this case?
Always start with a normalized database (your second option). Only denormalize if you have an absolute necessity for performance reasons. Designing it in the denormalized way (which is error-prone as you guessed) is premature optimization. With proper indexes it should be fine calculating the number on the fly.
I think the SQL statement should be fine. The other is duplication of data you already have. A count query should be quick.
Don't optimize prematurely. Use the simple solution and pagefault in optimizations only when they're needed.
I would query the database each time you want the information. I would revisit it later if you find that performance is lacking (optimize later). For the traffic most blog type applications will get, that should be sufficient.
Perhaps get the count back as part of the main thread query so as to limit the number of hits on the actual DB from the webserver. But I would always query the actual count and not try and keep it in a field, data will eventually get out of sync as that is reality.
To increase performance, you could keep a flag in the main table to indicate if the item has any comments but only use this as a 'hint' as to whether or not to perform an additional query to count and retrieve comments at a later time.
Imagine a photo gallery that returns 50 photos to rotate through. Each photo could have its own comments.
The initial page load would return a list of photos plus a flag indicating if a photo has comments.
When a photo is displayed, if the comments flag is set to True, your app would make an ajax request to count and fetch the comments for that photo.
If only 3 out of the 50 photos have comments, you just saved yourself 47 additional requests!
This does denormalize the data, but on a limited level.
Creating hints can really help improve performance for very busy sites.
Depending on how your data model looks...Don't add the total post count to the main thread record, it is error prone, you should calculate the comment count when needed based on the thread ID, IMHO
Caching the pages and updating that cache as comments are added/removed would be a good option a long with the SQL count query if you are that worried about the number of queries happening against the db..
I usually use an indexed view for this kind of thing. This allows you to denormalize the data for quick retrieval, but there is no way for it to get out of sync. Folks will also not be confused and think the view is the master of the data. I have mostly used the standard sku of SS2K5, so I have to specify the (noexpand) hint to get it to actually use the index on the view (enterprise will do it automatically). So for standard sku, I always create a wrapper view that everyone hits so I know the hint is always in place.
Coding this on the web page, so hopefully no syntax errors ;)
create view postCount__
as
select
threadId
,postCount=count_big(*)
from thread
group by threadId
go
create unique clustered index postCount__xpk_threadid on postCount__(threadId)
go
create view postCount
as
select
threadId
,postCount=cast(postCount as int)
from postCount__ with (noexpand)
go
So I use a nomenclature on the actual indexed view to let everyone know not to query it directly. Instead they look for the associated wrapper view that enforces the noexpand hint. Using an indexed view forces you to do count_big, so I often cast down to int in the wrapper view to be able to keep our asp.net code lazily using 32 bit ints. It would be better to omit the cast, but it hasn't been of any significant impact for me.
EDIT - I can tell you that forum software always denormalizes the post count to the thread table. It kills the DB to continually count the post count on every page view if you have an active forum. I love that mssql has indexed views so you can define the denormalization declaratively rather than maintain it yourself.