Best data structure design in Firestore between two? - firebase

I have a doubt on how to structure my firestore data.
Daily, I have information like:
TodayDate, From, ToUser1, Subject, Attachment2, AttachmentTypeB
TodayDate, From, ToUser1, Subject, Attachment3, AttachmentTypeA
TodayDate, From, ToUser2, Subject, Attachment4, AttachmentTypeA
TodayDate, From, ToUser2, Subject, Attachment5, AttachmentTypeC
Subject and From are never the same.
I am hesitating between Two structures but I am opened to consider other structure design.
0/ Root / doc / sub col / sub col fields
1/ users / userid / date / from,subject,etc
OR
2/ reports / date / userid / from,subject,etc
I believe solution 2 will be more cost saving in the long run since for one query, I will have more records per date than records per user. For the update, it is similar.
What are your advice, please?
Kind regards,
Julie

Given your current data structure, I suggest you simply use cloud firestore instead of realtime-database, as that scales better and you get quite good performance for very low cost.
You could start a collection, with each of the records containing your listed attributes: TodayDate, From, ToUser2, Subject, Attachment5, AttachmentTypeC. And its easy to query using where:
firestore().collection("myCollection").where("subject", "==", subject).get()
See this comparison.
UPDATE: Regarding your two options, I don't think it comes down to which option fetches/updates more/less records. It comes down to your app's requirements/actual usage. You might need to fetch the records for a specific user and not just for a specific date and vice versa. So, both structures don't really make any difference in terms of cost, unless you're sure you're sure you'll never need to fetch records per user.
Hence, I think the main focus should be on how intuitive and flexible your structure is and how easy it is to maintain over time. You should consider not using sub-collections in the first place, as it appears (from your daily record data) you could achieve what you need and get a more flexible structure with a simple collection containing documents with the necessary properties. I think sub-collections are generally needed when you are don't want to always fetch all properties of a record or when you want real-time listeners for specific properties and not the entire record. Sub-collections don't really increase/reduce the amount of records fetched, that depends on actual usage

Related

Indexing frequently updated counters (e.g., likes on a post and timestamps) in Firebase

I'm new to firebase and I'm currently trying to understand how to properly index frequently updating counters.
Let's say I have a list of articles on a news website. Every article is stored in my collection 'articles' and the documents inside have a like counter, a date when it was published and an id to represent a certain news category. I would like to be able to retrieve the most liked and latest articles for every category. Therefore I'm thinking about creating two indices, one for category type (in ASC order) and likes (DESC order) and one of the category type and the published date (DESC order).
I tried researching limitations and on the best practices page I found this, regarding creating hotspots with indices:
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
In my example I'm using articles which are not created too frequently. So I'm pretty sure this wouldn't create an issue, correct me if I'm wrong please. But I do still wonder if I could run into limitations or high costs with my approach (especially regarding to likes which can change frequently, while the timestamp is constant).
Is my approach to indexing likes and timestamps by category a sound approach or am I overseeing something?
If you are not adding documents at a high rate, then you will not trigger the limit that you cited in your question.
From the documentation:
Maximum write rate to a collection in which documents contain sequential values in an indexed field: 500 per second
If you are changing a single document frequently, then you will possibly trigger the limitation that a single document can't be updated more than 1 times per second (in a sustained burst of updates only, not a hard limit).
From the documentation on distributed counters:
In Cloud Firestore, you can only update a single document about once per second, which might be too low for some high-traffic applications.
That limit seems to (now) be missing from the formal documentation, not sure why that is. But I'm told that particular rate limit has been dropped. You might want to start a discussion on firebase-talk to get an official answer from Google staff.
Whether or not your approach is "sound" depends entirely on your expected traffic. We can't predict that for you, but you are at least aware of when things will go poorly.

Best way to store user-specific data in Firestore

I have an app that helps store owners manage their inventory through a simple API-driven interface.
My app stores all data on Firestore. My simplified database looks like this:
-users
-name
-email
-uid
-products
-atts
...
-ownerId
-someOtherThing
-atts
...
-ownerId
The idea is that only documents with ownerId that matches the current user ID will be accessible to the user. User with ID=5 will only have access to items that match ownerId=5.
Is this a good way of storing this data? I am worried that I will eventually end up with thousands of documents in that collection and querying them by "ownerId" might not be the best way to tackle this. On the other hand, I might end up with hundreds of users too, which probably makes it bad design to introduce several new collections for each of them?
What would be a better approach design-wise?
While "a good way" is subjective and purely dependent on the use-cases of your app, what you're proposing is quite a common way to store data in Firestore.
Your concern about the number of users and other documents is unwarranted, as Firestore guarantees that the performance of returning the (say) products for a specific user depends solely on the number of products returns, not on the total number of products in the database.
So if you have 10 products that you're the ownerId for, then no matter how many other users/products there are, the amount of time it takes to retrieve your 10 products will always be the same.

Firestore evergrowing collection

I'm working on an app where users create certain events in a calendar.
I was thinking on structuring the calendar events data as follows:
allEventsEver/{yearId}/months/{monthId}/events/{eventId}
I understand that
Firestore is optimized for storing large collections of small documents
but the structure above would mean that this would be an ever-growing collection. Is this something I should worry about? Would it be better to create a new collection for each year, e.g.:
2022/months/{monthId}/events/{eventId}
2023/months/{monthId}/events/{eventId}
Also, should I avoid using year/month value as document id (e.g. 2022) as those would be considered sequential ids that could cause hotspots that impact latency? If yes, what other approach do you suggest?
The most important/unique performance guarantee Firestore gives is that its query performance is independent of the number of documents in the collection. Query performance only depends on how much data you return, not on how much data needs to be considered.
So an ever-growing collection is not a concern on Firestore. As long put a limit on how many results your query can return, you'll have an upper bound on how much time it will take.

Firebase / NoSQL - How to aggregate data for statistics

I'm creating my first ever project with Firebase, and I come to the point when I need some statistics based on user input. I know Firebase (or NoSQL databases in general) are not ideal for statistics but they work for me in any other cases so I would like to give it a try.
What I have:
I work on the application where people can invite a friend to work for their company, so I have a collection of "referrals" where ID of each referral is basically UserID of a user to who the referral belongs, and then there is a subcollection with name "items" where data are stored.
How my data looks like:
Each item have these data:
applicant
appliedDate
position(part of position is positionId & department on which this position is coming from)
status
What I wanted is to let user to make statistics based on:
date range
status
department
What I was thinking about:
It's probably not the best idea to let firebase iterate over all referrals once users make requests as it may get really expensive on firebase. What I was thinking of is using cloudfunctions to calculate statistics always when something change e.g. when a new applicant applies I will increase the counter by one and the same for a counter to a specific department. However I feel like this make work for total numbers or for predefined queries e.g. "LAST MONTH" but once I will not know what dates user will select it start to get tricky.
Any idea how can I design something like this?
Thanks a lot!
What you're considering is the idiomatic approach to calculate aggregated in Firestore, and most NoSQL databases. If you follow this pattern, Firestore is quite well suited to storing statistics.
It's ad-hoc statistic, like the unknown data range, that are trickier. Usually this comes down to storing the right values to allow you to get rid of the need to read an unknown number of documents to calculate a value.
For example, if you store counters for the statistics per month, week, day and hour, you can satisfy a wide range of date ranges with a limited number of read operations. You may need to read multiple documents, but the number of documents to read depends on the range, and not on the total number of documents in the database.
Of course, for the most flexible ad-hoc querying, you may still want to consider another solution, such as BigQuery, which was made precisely for this use-case.

Queryable unbound amount of items

I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.

Resources