We have migrated all of our old Firebase BigQuery events tables to the new schema using the provided script. One thing we noticed was that the size of the daily tables increased dramatically.
For example, the data from 4/1/18 in the old schema was 3.5MM rows and 8.7 Gig. Once migrated, the new table from the same date is 32.3MM rows and 27 Gig. This is nearly 10 times larger in terms of number of rows and over 3X larger by space size.
Can someone tell me why the same data is so much larger in the new schema?
The result is that we are getting charged significantly more in BigQuery query costs when reading the tables from the new schema versus the old schema.
firebaser here
While increasing the size of the exported data definitely wasn't a goal, it is an expected side-effect of the new schema.
In the old storage format the events were stored in bundles. While I don't exactly know how the events are bundled, it was definitely always a bunch of events with their own unique and with shared properties. This meant that you frequently had to unnest the data in your query or cross join the tables with themselves, to get to the raw data, and then combine and group it again to fit your requirements.
In the new storage format, each event is stored separately. This definitely increases the storage size, since properties that were shared between events in a bundle, now are duplicated for each event. But the queries you write on the new format should be easier to read and can process the data faster, since they don't have to unnest it first.
So the larger storage size should come with a slightly faster processing speed. But I can totally imagine the sticker shock when you see the difference, and realize the improved speed doesn't always make up for that. I apologize if that is the case, and have been assured that don't have any other big schema changes planned from here on.
Related
Is there any good documentation on how query times change for a DynamoDB table based on equal read capacity and differing row sizes? I've been reading through the documentation and can't find anything, was wondering if anybody has done any studies into this?
My use case is that I'm putting a million rows into a table a week. These records are referenced quite a bit as they're entered but as time goes on the frequency at which I query those rows decreases. Can I leave those records in the table indefinitely with no detrimental effect on query time, or should I rotate them out so the newer data that is requested more frequently returns faster?
Please don't keep the old data indefinitely. It is advised to archive the data for better performance.
Few points on design and testing:-
Designing the proper hash key, so that the data is distributed
access the partitions
Understand Access Patterns for Time Series Data
Test your application at scale to avoid problems with "hot" keys
when your table becomes larger
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with a
composite primary key consisting of Customer ID as the partition key
and date/time as the sort key. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
Time Series Data Access Pattern
Guidelines for table partitions
I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.
I want to retrieve data from Google Analytics API, create custom calculations and then push the aggregations to a Google Spreadsheets in order to reuse in Google Visualisation API app. My concern is that I'll hit the Spreadsheet cell quota very quickly with the raw data needed for the calculation.
I know scriptDB quota is 100MB but before I invest time and resources in learning how it works I'd like to get an idea whether it's feasible for storing raw analytics data (provided it's not too granular and it's just designed to answer specific questions) and how much of it I could realistically store in scriptDB (relative to spreadsheets) before I hit the quota.
Thanks
For bulk data access (e.g. reading a table for Visualization), a spreadsheet will have a speed advantage over ScriptDb. What is faster: ScriptDb or SpreadsheetApp? If you wish to support more sophisticated queries though, to "answer specific questions" as you mention, then ScriptDb will give you an edge, as query times vary with the number of results but should be unaffected by the query criteria themselves.
With data in a spreadsheet, you will be able to obtain a DataTable for Visualization with a single Range.getDataTable() operation. With ScriptDb, you will need to write a script to build your DataTable.
Regarding size constraints, it's not possible to really compare the two without knowing the size of your individual data elements. You're already aware of the general constraints:
Spreadsheet, 40K cells, but may hit (unspecified) size limit before that, depending on data element sizes.
ScriptDb, 50MB, 100MB or 200MB depending on account type. The number of objects that can be stored is affected by the complexity (depth) of the object and the size of the property names, and of course the size of data contained in the objects.
Ultimately, the question of which is best for your application is a matter of opinion, and of which factors matter most for the application. If the analytics data is tabular, then a spreadsheet offers advantages for implementation largely because of Range.getDataTable(), and is faster for bulk access. I'd recommend starting there, and considering a move to ScriptDb if and when you actually hit spreadsheet size or query performance limitations.
I'm testing firebase for a project that may have a reasonably large numbers of keys, potentially millions.
I've tested loading a few 10k of records using node, and the load performance appears good. However the "FORGE" Web UI becomes unusably slow and renders every single record if I expand my root node.
Is Firebase not designed for this volume of data, or am I doing something wrong?
It's simply the limitations of the Forge UI. It's still fairly rudimentary.
The real-time functions in Firebase are not only suited for, but designed for large data sets. The fact that records stream in real-time is perfect for this.
Performance is, as with any large data app, only as good as your implementation. So here are a few gotchas to keep in mind with large data sets.
DENORMALIZE, DENORMALIZE, DENORMALIZE
If a data set will be iterated, and its records can be counted in thousands, store it in its own path.
This is bad for iterating large data sets:
/users/uid
/users/uid/profile
/users/uid/chat_messages
/users/uid/groups
/users/uid/audit_record
This is good for iterating large data sets:
/user_profiles/uid
/user_chat_messages/uid
/user_groups/uid
/user_audit_records/uid
Avoid 'value' on large data sets
Use the child_added since value must load the entire record set to the client.
Watch for hidden value operations on children
When you call child_added, you are essentially calling value on every child record. So if those children contain large lists, they are going to have to load all that data to return. Thus, the DENORMALIZE section above.
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.