will document ids like {year}-Q{n} cause performance problems? - firebase

The Firestore docs "Best Practices" section says:
Document IDs
Do not use monotonically increasing document IDs such as:
Customer1, Customer2, Customer3, ...
Product 1, Product 2, Product 3, ...
I'd like to store historical data in a collection with one document per quarter, and use document ids like 2022-Q1, 2022-Q2, etc.
These aren't strictly monotonically increasing because the year number changes after 4 quarters. But they're still lexicographically close. Will this cause performance problems?
My data goes back about 50 years so there will be around 200 documents.

200 documents with any IDs are not going to cause any performance problems. That is a very small amount of data from Firestore's point of view.
The recommendation that you're citing from the docs has to do with situations where documents are added very rapidly to a collection. If you read on, it says "Such sequential IDs can lead to hotspots that impact latency". If you read about hotspotting, that is not your situation, so you don't have to worry about it.
Aside, it's generally not a good idea to put data in your document IDs. That can cause problems for you later. Data belongs in document fields. You're generally better off using Firestore's randomly generated IDs unless you already have a source of unique IDs.

Related

Indexing frequently updated counters (e.g., likes on a post and timestamps) in Firebase

I'm new to firebase and I'm currently trying to understand how to properly index frequently updating counters.
Let's say I have a list of articles on a news website. Every article is stored in my collection 'articles' and the documents inside have a like counter, a date when it was published and an id to represent a certain news category. I would like to be able to retrieve the most liked and latest articles for every category. Therefore I'm thinking about creating two indices, one for category type (in ASC order) and likes (DESC order) and one of the category type and the published date (DESC order).
I tried researching limitations and on the best practices page I found this, regarding creating hotspots with indices:
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
In my example I'm using articles which are not created too frequently. So I'm pretty sure this wouldn't create an issue, correct me if I'm wrong please. But I do still wonder if I could run into limitations or high costs with my approach (especially regarding to likes which can change frequently, while the timestamp is constant).
Is my approach to indexing likes and timestamps by category a sound approach or am I overseeing something?
If you are not adding documents at a high rate, then you will not trigger the limit that you cited in your question.
From the documentation:
Maximum write rate to a collection in which documents contain sequential values in an indexed field: 500 per second
If you are changing a single document frequently, then you will possibly trigger the limitation that a single document can't be updated more than 1 times per second (in a sustained burst of updates only, not a hard limit).
From the documentation on distributed counters:
In Cloud Firestore, you can only update a single document about once per second, which might be too low for some high-traffic applications.
That limit seems to (now) be missing from the formal documentation, not sure why that is. But I'm told that particular rate limit has been dropped. You might want to start a discussion on firebase-talk to get an official answer from Google staff.
Whether or not your approach is "sound" depends entirely on your expected traffic. We can't predict that for you, but you are at least aware of when things will go poorly.

How can I know if indexing a timestamp field on a collection of documents is going to cause problems?

I saw on the Firestore documentation that it is a bad idea to index monotonically increasing values, that it will increase latency. In my app I want to query posts based on unix time which is a double and that is a number that will increase as time moves on, but in my case not perfectly monotonically because people will not be posting every second, in addition I don't think my app will exceed 4 million users. does anyone with expertise think this will be a problem for me
It should be no problem. Just make sure to store it as number and not as String. Othervise the sorting would not work as expected.
This is exactly the problem that the Firestore documentation is warning you about. Your database code will incur a cost of "hotspotting" on the index for the timestamp at scale. Specifically, from that linked documentation:
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
The numbers don't have to be purely monotonic. The hotspotting happens on ranges that are used for sharding the index. The documentation just doesn't tell you what to expect for those ranges, as they can change over time as the index gains more documents.
Also from the documentation:
If you index a field that increases or decreases sequentially between documents in a collection, like a timestamp, then the maximum write rate to the collection is 500 writes per second. If you don't query based on the field with sequential values, you can exempt the field from indexing to bypass this limit.
In an IoT use case with a high write rate, for example, a collection containing documents with a timestamp field might approach the 500 writes per second limit.
If you don't have a situation where new documents are being added rapidly, it's not a near-term problem. But you should be aware that it just doesn't not scale up like reads and queries will scale against that index. Note that number of concurrent users is not the issue at all - it's the number of documents being added per second to an index shard, regardless of how many people are causing the behavior.

Queryable unbound amount of items

I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.

Best data structure design in Firestore between two?

I have a doubt on how to structure my firestore data.
Daily, I have information like:
TodayDate, From, ToUser1, Subject, Attachment2, AttachmentTypeB
TodayDate, From, ToUser1, Subject, Attachment3, AttachmentTypeA
TodayDate, From, ToUser2, Subject, Attachment4, AttachmentTypeA
TodayDate, From, ToUser2, Subject, Attachment5, AttachmentTypeC
Subject and From are never the same.
I am hesitating between Two structures but I am opened to consider other structure design.
0/ Root / doc / sub col / sub col fields
1/ users / userid / date / from,subject,etc
OR
2/ reports / date / userid / from,subject,etc
I believe solution 2 will be more cost saving in the long run since for one query, I will have more records per date than records per user. For the update, it is similar.
What are your advice, please?
Kind regards,
Julie
Given your current data structure, I suggest you simply use cloud firestore instead of realtime-database, as that scales better and you get quite good performance for very low cost.
You could start a collection, with each of the records containing your listed attributes: TodayDate, From, ToUser2, Subject, Attachment5, AttachmentTypeC. And its easy to query using where:
firestore().collection("myCollection").where("subject", "==", subject).get()
See this comparison.
UPDATE: Regarding your two options, I don't think it comes down to which option fetches/updates more/less records. It comes down to your app's requirements/actual usage. You might need to fetch the records for a specific user and not just for a specific date and vice versa. So, both structures don't really make any difference in terms of cost, unless you're sure you're sure you'll never need to fetch records per user.
Hence, I think the main focus should be on how intuitive and flexible your structure is and how easy it is to maintain over time. You should consider not using sub-collections in the first place, as it appears (from your daily record data) you could achieve what you need and get a more flexible structure with a simple collection containing documents with the necessary properties. I think sub-collections are generally needed when you are don't want to always fetch all properties of a record or when you want real-time listeners for specific properties and not the entire record. Sub-collections don't really increase/reduce the amount of records fetched, that depends on actual usage

Theory question: what strategy is faster? Querying a lot of documents vs query fewer documents and then load some?

I'm wondering whats the better structure for my Firestore database.
I want to create some sort of appointment manager for employee where I can show every employee its appointment for some date. I have thought of these two options:
Option:
Every employee has a collection Appointments where I save all the upcoming appointments. The appointment documents would have a column date.
When I want to load all appointments for a date I would have to query all appointments by this date.
Option:
Every employee has a collection Workdays with documents for each day. These workday documents would have the column date. And then a collection with Appointments where I save the appointments for a workday.
When I want to load all appointments, I would have to query the Workdays collection for the correct date and then load all its Appointments.
I expect an average workday to contain 10-20 appointments. And let's say I save appointments for the next 30 days. For option 1, I would then have to query 300-600 documents down to 10-20.
For option 2 I would have 30 documents and query it for 1 documents. Then load around 10-20 documents.
So in option 2 I would have to query fewer documents, but I would have to wait until the query is finished and then load 10-20 further documents. While for option 1, I would have to query more documents but once this query is finished I wouldn't have to load any more documents.
I'm wondering what option is the faster for my use case - any thoughts?
Documents are not necessarily tabular (columnar). Keep it simple, follow their documentation, Choose a data structure, and do not overthink optimizing search. Leave query optimization to the Firebase platform/team as there are several search approaches which might be implemented, depending on the type of data you are querying for. Examples include source Wikipedia:
Dijkstra's algorithm, Kruskal's algorithm, the nearest neighbour
algorithm, and Prim's algorithm.
Again, provided you basically follow their data structure guideline, the optimal search approach should be baked in to the Firebase/Firestore platform and may be optimized by them when possible. In short, the speed of the compute platform will amaze you. Focus on higher level tasks relating to your particular app.
If the total number of documents read in each case is the same, the faster option will be the one that reduces the number of round trips between the client and server. So, fewer total queries would be better. The total number of documents in the collection is not going to affect the results very much.
With Firestore, performance of queries scales with the size of the result set (total number of documents read), not with the total size of the collection.
The first option is rather straightforward, and is definitely the way you'd do it with a relational database. The date column could become a natural foreign key to any potential workday table, assuming one is even needed.
The second option is more complicated because there are three data cases:
Workday does not exist
Workday does exist but has no appointments in the list
Workday exists and has appointments
In terms of performance, they are not likely to be very different, but if there is a significant gap, I'd gamble option 1 to be more efficient.

Resources