I'm creating my first ever project with Firebase, and I come to the point when I need some statistics based on user input. I know Firebase (or NoSQL databases in general) are not ideal for statistics but they work for me in any other cases so I would like to give it a try.
What I have:
I work on the application where people can invite a friend to work for their company, so I have a collection of "referrals" where ID of each referral is basically UserID of a user to who the referral belongs, and then there is a subcollection with name "items" where data are stored.
How my data looks like:
Each item have these data:
applicant
appliedDate
position(part of position is positionId & department on which this position is coming from)
status
What I wanted is to let user to make statistics based on:
date range
status
department
What I was thinking about:
It's probably not the best idea to let firebase iterate over all referrals once users make requests as it may get really expensive on firebase. What I was thinking of is using cloudfunctions to calculate statistics always when something change e.g. when a new applicant applies I will increase the counter by one and the same for a counter to a specific department. However I feel like this make work for total numbers or for predefined queries e.g. "LAST MONTH" but once I will not know what dates user will select it start to get tricky.
Any idea how can I design something like this?
Thanks a lot!
What you're considering is the idiomatic approach to calculate aggregated in Firestore, and most NoSQL databases. If you follow this pattern, Firestore is quite well suited to storing statistics.
It's ad-hoc statistic, like the unknown data range, that are trickier. Usually this comes down to storing the right values to allow you to get rid of the need to read an unknown number of documents to calculate a value.
For example, if you store counters for the statistics per month, week, day and hour, you can satisfy a wide range of date ranges with a limited number of read operations. You may need to read multiple documents, but the number of documents to read depends on the range, and not on the total number of documents in the database.
Of course, for the most flexible ad-hoc querying, you may still want to consider another solution, such as BigQuery, which was made precisely for this use-case.
Related
Let's say i have an multi-restaurant food order app.
I'm storing orders in Firestore as documents.
Each order object/document contains:
total: double
deliveredByUid: str
restaurantId: str
I wanna see anytime during the day, the totals of every Driver to each Restaurant like so:
robert: mcdonalds: 10
kfc: 20
alex: mcdonalds: 35
kfc: 10
What is the best way of calculating the totals of all the orders?
I currently thinking of the following:
The safest and easiest method but expensive: Each time i need to know the totals i just query all the documents in that day and calculate them 1 by 1
Cloud Functions method: Each time an order has been added/removed modify a value in a Realtime database specific child: /totals/driverId/placeId
Manual totals: Each time a driver complete an order and write its id to the order object, make another write to the Realtime database specific child.
Edit: added the whole order object because i was asked to.
What I would most likely do is make sure orders are completely atomic (or as atomic as they can be). Most likely, I'd perform the order on the client within a transaction or batch write (both are atomic) that would not only create this document in question but also update the delivery driver's document by incrementing their running total. Depending on how extensible I wanted to get, I may even create subcollections within the user's document that represented chunks of time if I wanted to be able to record totals by month or year, or whatever. You really want to think this one through now.
The reason I'd advise against your suggested pattern is because it's not atomic. If the operation succeeds on the client, there is no guarantee it will succeed in the cloud. If you make both writes part of the same transaction then they could never be out of sync and you could guarantee that the total will always be accurate.
I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.
I have a doubt on how to structure my firestore data.
Daily, I have information like:
TodayDate, From, ToUser1, Subject, Attachment2, AttachmentTypeB
TodayDate, From, ToUser1, Subject, Attachment3, AttachmentTypeA
TodayDate, From, ToUser2, Subject, Attachment4, AttachmentTypeA
TodayDate, From, ToUser2, Subject, Attachment5, AttachmentTypeC
Subject and From are never the same.
I am hesitating between Two structures but I am opened to consider other structure design.
0/ Root / doc / sub col / sub col fields
1/ users / userid / date / from,subject,etc
OR
2/ reports / date / userid / from,subject,etc
I believe solution 2 will be more cost saving in the long run since for one query, I will have more records per date than records per user. For the update, it is similar.
What are your advice, please?
Kind regards,
Julie
Given your current data structure, I suggest you simply use cloud firestore instead of realtime-database, as that scales better and you get quite good performance for very low cost.
You could start a collection, with each of the records containing your listed attributes: TodayDate, From, ToUser2, Subject, Attachment5, AttachmentTypeC. And its easy to query using where:
firestore().collection("myCollection").where("subject", "==", subject).get()
See this comparison.
UPDATE: Regarding your two options, I don't think it comes down to which option fetches/updates more/less records. It comes down to your app's requirements/actual usage. You might need to fetch the records for a specific user and not just for a specific date and vice versa. So, both structures don't really make any difference in terms of cost, unless you're sure you're sure you'll never need to fetch records per user.
Hence, I think the main focus should be on how intuitive and flexible your structure is and how easy it is to maintain over time. You should consider not using sub-collections in the first place, as it appears (from your daily record data) you could achieve what you need and get a more flexible structure with a simple collection containing documents with the necessary properties. I think sub-collections are generally needed when you are don't want to always fetch all properties of a record or when you want real-time listeners for specific properties and not the entire record. Sub-collections don't really increase/reduce the amount of records fetched, that depends on actual usage
Let's say I have a collection of cars and I want to filter them by price range and by year range. I know that Firestore has strict limitations due performance reasons, so something like:
db.collection("products")
.where('price','>=', 70000)
.where('price','<=', 90000)
.where('year','>=', 2015)
.where('year','<=', 2018)
will throw an error:
Invalid query. All where filters with an inequality (<, <=, >, or >=) must be on the same field.
So is there any other way to perform this kind of query without local data managing? Maybe some kind of indexing or tricky data organization?
The error message and documentation are quite explicit on this: a Firestore query can only perform range filtering on a single field. Since you're trying to filter ranges on both price and year, that is not possible in a single Firestore query.
There are two common ways around this:
Perform filtering on one field in the query, and on the other field in your client-side code.
Combine the values of the two range into a single field in some way that allows your use-case with a single field. This is incredibly non-trivial, and the only successful example of such a combination that I know of is using geohashes to filter on latitude and longitude.
Given the difference in effort between these two, I'd recommend picking the first option.
A third option is to model your data differently, as to make it easier to implement your use-case. The most direct implementation of this would be to put all products from 2015-2018 into a single collection. Then you could query that collection with db.collection("products-2015-2018").where('price','>=', 70000).where('price','<=', 90000).
A more general alternative would be to store the products in a collection for each year, and then perform 4 queries to get the results you're looking for: one of each collection products-2015, products-2016, products-2017, and products-2018.
I recommend reading the document on compound queries and their limitations, and watching the video on Cloud Firestore queries.
You can't do multiple range queries as there are limitations mentioned here, but with a little cost to the UI, you can still achieve by indexing the year like this.
db.collection("products")
.where('price','>=', 70000)
.where('price','<=', 90000)
.where('yearCategory','IN', ['new', 'old'])
Of course, new and old go out of date, so you can group the years into yearCategory like yr-2014-2017, yr-2017-2020 so on. The in can only take 10 elements per query so this may give you an idea of how wide of a range to index the years.
You can write to yearCategory during insert or, if you have a large range such as a number of likes, then you'd want another process that polls these data and updates the category.
In Flutter You can do something like this,
final _queryList = await db.collection("products").where('price','>=', 70000).get();
final _docL1 = _querList.where('price','<=', 90000);
Add more queries as you want, but for firestore, you can only request a limited number of queries, and get the data. After that you can filter out according to your need.
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.