I have been using Postgres to store time-series sensor data but I am weighing the cost of using Firestore cause I prefer the serverless nature of Firestore. My only concern is the cost of Firestore because I am paying for every read. I want to be able to display this sensor information on my web app. Now, I am taking data every 10 seconds and theres over 400+ sensor points (400 columns per row in my postgres table)
Currently, if a user queries for a week's work of data that's about 60,000 rows of data, but I optimise it by just taking every nth value to "feather" the data. So by taking every 20th row for example, I have reduced the return of the data to 3000 rows which is manageable and still the chart shows a clear trend.
I want to be able to do this in Firestore to save costs, because if a user queries for a week's data, I am paying for 60000 document reads which I can't display all those data points on the web app anyway. I have tried searching for ways to query firestore to take the Nth row of data, but haven't found any concrete solutions.
Does anybody have any recommendation how I can optimise my Firestore costs for time series data or perhaps any other cheap serverless methods to manage this data?
Firestore doesn't offer any way to "feather" data from queries, as you say. What you could do instead is put an integer in each document that describes its "Nth" value, then query for only those "N" that you want.
Related
I have a configured google analytics raw data export to big query.
Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.
For example:
Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.
On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.
https://issuetracker.google.com/issues/117064598
It's not cost-effective for persisting & querying the data.
And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?
Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB
In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.
Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).
I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.
An advice that I can give you about querying the ga_realtime_sessions table is:
Only query for the columns you really need (no select *);
use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.
I am wondering if anyone knows a good way to store time series data of different time resolutions in DynamoDB.
For example, I have devices that send data to DynamoDB every 30 seconds. The individual readings are stored in a Table with the unique device ID as the Hash Key and a timestamp as the Range Key.
I want to aggregate this data over various time steps (30 mins, 1 hr, 1 day etc.) using a lambda and store the aggregates in DynamoDB as well. I then want to be able to grab any resolution data for any particular range of time, 48 30 minute aggregates for the last 24hrs for instance, or each daily aggregate for this month last year.
I am unsure if each new resolution should have its own tables, data_30min, data_1hr etc or if a better approach would be something like making a composite Hash Key by combining the resolution with the Device ID and storing all aggregate data in a single table.
For instance if the device ID is abc123 all 30 minute data could be stored with the Hash Key abc123_30m and the 1hr data could be stored with the HK abc123_1h and each would still use a timestamp as the range key.
What are some pros and cons to each of these approaches and is there a solution I am not thinking of which would be useful in this situation?
Thanks in advance.
I'm not sure if you've seen this page from the tech docs regarding Best Practices for storing time series data in DynamoDB. It talks about splitting your data into time periods such that you only have one "hot" table where you're writing and many "cold" tables that you only read from.
Regarding the primary/sort key selection, you should probably use a coarse timestamp value as the primary key and the actual timestamp as a sort key. Otherwise, if your periods are coarse enough, or each device only produces a relatively small amount of data then your idea of using the device id as the hash key could work as well.
Generating pre-aggregates and storing in DynamoDb would certainly work though you should definitely consider having separate tables for the different granularities you want to support. Beware of mutating data. As long as all your data arrives in order and you don't need to recompute old data, then storing pre-aggregated time series is fine but if data can mutate, or if you have to account for out-of order/late arriving data then things get complicated.
You may also consider a relational database for the "hot" data (ie. last 7 days, or whatever period makes sense) and then, running a batch process to pre-aggregate and move the data into cold, read-only DynamoDB tables, with DAX etc.
Is there any good documentation on how query times change for a DynamoDB table based on equal read capacity and differing row sizes? I've been reading through the documentation and can't find anything, was wondering if anybody has done any studies into this?
My use case is that I'm putting a million rows into a table a week. These records are referenced quite a bit as they're entered but as time goes on the frequency at which I query those rows decreases. Can I leave those records in the table indefinitely with no detrimental effect on query time, or should I rotate them out so the newer data that is requested more frequently returns faster?
Please don't keep the old data indefinitely. It is advised to archive the data for better performance.
Few points on design and testing:-
Designing the proper hash key, so that the data is distributed
access the partitions
Understand Access Patterns for Time Series Data
Test your application at scale to avoid problems with "hot" keys
when your table becomes larger
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with a
composite primary key consisting of Customer ID as the partition key
and date/time as the sort key. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
Time Series Data Access Pattern
Guidelines for table partitions
I am writing an app that will store regular temperature readings, and am looking to use Apigee App Services for the storage. However, to chart the temperature readings over time, it is inefficient to pull all the readings out over a period (e.g. a month) because there would be too many (there's one every 15 seconds or so), especially when the most common case would be to show a trend. The app could support (a) retrieving only every nth sample (for appropriate choice of n depending on the graph), (b) retrieving the average (or min, or max) of groups of n samples over the period, or (c) retrieving n, evenly spaced samples, over the period. However, it doesn't look like Apigee would support any of these using their data retrieval APIs.
I would've thought that retrieving time-series data in such a fashion is not an usual use-case, so hopefully someone's already tackled this. Is it possible?
One way you may accomplish this is by having a field (called sample_bin) that is assigned a value RANDOM(0-n) when you save it. Then, when you query the data, add in the condition that sample_bin = a specific number 0-n. This would save you from retrieving all of the records from the database to do the sampling. This should result in a more or less evenly distributed random sampling.
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.