I am making a Moneymanagement-App where the user can create Transfers for each day.
I am currently listing all the data on the mainscreen. At the moment that doesn't matter because there isn't much data but imagine a user who uses the app several years and tracking all his spendings.
My first thought was to cache all the available Data for that user but that would cause too many unnecessary reads because the user most likely won't need the data from lets say 5 years ago.
So I thought the solution would be to just implement pagination for that screen.
But :
The user can get statistics about his spendinghistory on another screen by selecting a category and a timeperiod.
Currently i am running a query on those parameters each time they change but this will obviously also lead to a lot of unnecessary reads.
So the problem is, if the user chooses to get statistics from 5 years ago, that Data wouldn't exist in the cache so i would still have to run a query for this time period and then end up with a incomplete cache of that period because i only got some of the Data based on the Query.
Would love to hear your thoughts on this. How would you handle it ?
In general: don't run aggregation queries from the client on demand. Instead store aggregated data in the database, and update it as data is written.
So say that you keep some annual totals, such as their balance at the start and end of the year, their total income and spend for that year, probably broken down by categories. That is all information that you could put in a document for each year.
You'd have a structure /users/$uid/totals/$year and you then have the totals in fields in that document. Every time you write a new transaction, you update the totals document for that user for the current year.
If you do this, you'll only need to read the totals document to show totals, and you'll only need to read individual transactions if you want to show individual transactions.
Also see: Is it possible to run aggregation queries in Firestore?
Related
Let's say i have an multi-restaurant food order app.
I'm storing orders in Firestore as documents.
Each order object/document contains:
total: double
deliveredByUid: str
restaurantId: str
I wanna see anytime during the day, the totals of every Driver to each Restaurant like so:
robert: mcdonalds: 10
kfc: 20
alex: mcdonalds: 35
kfc: 10
What is the best way of calculating the totals of all the orders?
I currently thinking of the following:
The safest and easiest method but expensive: Each time i need to know the totals i just query all the documents in that day and calculate them 1 by 1
Cloud Functions method: Each time an order has been added/removed modify a value in a Realtime database specific child: /totals/driverId/placeId
Manual totals: Each time a driver complete an order and write its id to the order object, make another write to the Realtime database specific child.
Edit: added the whole order object because i was asked to.
What I would most likely do is make sure orders are completely atomic (or as atomic as they can be). Most likely, I'd perform the order on the client within a transaction or batch write (both are atomic) that would not only create this document in question but also update the delivery driver's document by incrementing their running total. Depending on how extensible I wanted to get, I may even create subcollections within the user's document that represented chunks of time if I wanted to be able to record totals by month or year, or whatever. You really want to think this one through now.
The reason I'd advise against your suggested pattern is because it's not atomic. If the operation succeeds on the client, there is no guarantee it will succeed in the cloud. If you make both writes part of the same transaction then they could never be out of sync and you could guarantee that the total will always be accurate.
I have a problem regarding the organization of my data. What I want to achieve:
What I want to achieve
TL/DR: One data point updated in real time in many different groups, how to organize?
Each user sets a daily goal (goal) he wants to achieve
Upon working each user increases his time to get closer to his daily goal (daily_time_spent). (say from 1 minute spent to 2 minute spent).
Each user can also be in a group with other users.
If there is a group of users, you can see each other's progress (goal/daily_time_spent) in real time (real time being every 2-5 minutes, for cost reasons).
It will later also be possible to set a daily goal for a specific group. Your own daily goal would contribute to each of the groups.
Say you are part of three groups with the goals 10m/20m/30m and you already did 10m then you would complete the first group and have done 50% of the second group and 30% of the third group. Your own progress (daily_time_spent) contributes to all groups, regardless of the individual goals (group_daily_goal).
My ideas
How would I organize that? One idea is if a user increments his/her time, the time gets written down into each group the user is part of and then, when the user increases his time, it gets increased in each group he/she is part of. But this seems to be pretty inefficient, because I would potentially write the same data in many different places (coming from the background of a SQL-Developer, it might also be expensive?).
Another option: Each user tracks his time, say under userTimes/{user} and then there are the groups: groups/{groupname} with links to userTimes. But then I don't know how to get realtime updates.
Thanks a lot for your help!
Both approach can work fine, and there is no singular best approach here - as Alex said, it all depends on the use-cases of your app, and your comfort level with the code that is required for each of them.
Duplicating the data under each relevant user will complicate the code that performs the write operation, and store more data. But in return for that, reading the data will be really simple and scale very well to many users.
Reading the data from under all followed users will complicate the code that performs the read operation, and slow it down a bit (though not nearly as much as you may expect, as Firebase can pipeline the requests). But it does keep your data minimal and your write operations simple.
If you choose to duplicate the data, that is an operation that you can usually do well in a (RTDB-triggered) Cloud Function, but it's also possible to do it through a multi-path write operation from the client.
I am creating an application that uses cloud firestore to store data about "events" in our lab on several assets. We collected data for a few months and we are averaging about 2000 events per asset per month. Each event captures a few pieces of meta data that the user can query.
I imported all the data into firestore with a very simple layout at first.
Events (Collection of event data)
-> EventData (documents which contains a few fields for metadata)
From my understanding, even if the collection of events becomes quite large, for billing and speed of queries this won't be a problem (assuming I do some sort of pagination on the query results). The composite indexes are also very manageable with this structure.
The problem I see, is if someone goes and looks at the firestore console and brings that collection up, our read requests go through the roof. It seems that does a full read on the entire collection...which of course will kill us on billing as time goes on. I don't see this as a problem forever as eventually we should get to the point where everything is stable and won't need to go into the console very often, but what if someone does when we have a million or more records.
My next thought was to structure the database like this:
Events -> Assets -> {Asset_Name } -> {year_month} -> {Collection of
Document with field meta-data}
This certainly solves the issue of the ever growing collection of documents. The number of assets that we have is fixed, and the number of events is (effectively) capped to a maximum amount per month as well. The problem with this setup, however, is managing composite indexes. There are about 5 indexes needed for my original setup. I think this alternative setup means I would need to setup the same 5 indexes for each each collection of documents for every asset every month.
I thought maybe there could be a way to have a cloud function manage it for me (it doesn't appear there is an API for this). I think the number of indexes per project is also capped.
So, in the end, I am looking for recommendations on how to structure this database to limit reads if using the console, as well as keeping the indexes manageable. I am pretty new to NoSQL and perhaps I am just completely off.
I recommend you keep your structure as is if that's what's working for you. You should not need to optimize for reducing console reads. Console reads do count towards your usage but the console does not load the entire collection when you open the console.
The console loads just enough documents to let you scroll a bit and then it loads more documents if you scroll down. It will only load the entire collection if you scroll through the entire collection.
I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.
This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.