How to store and process time-series data in Apigee App Services? - apigee

I am writing an app that will store regular temperature readings, and am looking to use Apigee App Services for the storage. However, to chart the temperature readings over time, it is inefficient to pull all the readings out over a period (e.g. a month) because there would be too many (there's one every 15 seconds or so), especially when the most common case would be to show a trend. The app could support (a) retrieving only every nth sample (for appropriate choice of n depending on the graph), (b) retrieving the average (or min, or max) of groups of n samples over the period, or (c) retrieving n, evenly spaced samples, over the period. However, it doesn't look like Apigee would support any of these using their data retrieval APIs.
I would've thought that retrieving time-series data in such a fashion is not an usual use-case, so hopefully someone's already tackled this. Is it possible?

One way you may accomplish this is by having a field (called sample_bin) that is assigned a value RANDOM(0-n) when you save it. Then, when you query the data, add in the condition that sample_bin = a specific number 0-n. This would save you from retrieving all of the records from the database to do the sampling. This should result in a more or less evenly distributed random sampling.

Related

Firebase: How to organize data that is synced to many groups

I have a problem regarding the organization of my data. What I want to achieve:
What I want to achieve
TL/DR: One data point updated in real time in many different groups, how to organize?
Each user sets a daily goal (goal) he wants to achieve
Upon working each user increases his time to get closer to his daily goal (daily_time_spent). (say from 1 minute spent to 2 minute spent).
Each user can also be in a group with other users.
If there is a group of users, you can see each other's progress (goal/daily_time_spent) in real time (real time being every 2-5 minutes, for cost reasons).
It will later also be possible to set a daily goal for a specific group. Your own daily goal would contribute to each of the groups.
Say you are part of three groups with the goals 10m/20m/30m and you already did 10m then you would complete the first group and have done 50% of the second group and 30% of the third group. Your own progress (daily_time_spent) contributes to all groups, regardless of the individual goals (group_daily_goal).
My ideas
How would I organize that? One idea is if a user increments his/her time, the time gets written down into each group the user is part of and then, when the user increases his time, it gets increased in each group he/she is part of. But this seems to be pretty inefficient, because I would potentially write the same data in many different places (coming from the background of a SQL-Developer, it might also be expensive?).
Another option: Each user tracks his time, say under userTimes/{user} and then there are the groups: groups/{groupname} with links to userTimes. But then I don't know how to get realtime updates.
Thanks a lot for your help!
Both approach can work fine, and there is no singular best approach here - as Alex said, it all depends on the use-cases of your app, and your comfort level with the code that is required for each of them.
Duplicating the data under each relevant user will complicate the code that performs the write operation, and store more data. But in return for that, reading the data will be really simple and scale very well to many users.
Reading the data from under all followed users will complicate the code that performs the read operation, and slow it down a bit (though not nearly as much as you may expect, as Firebase can pipeline the requests). But it does keep your data minimal and your write operations simple.
If you choose to duplicate the data, that is an operation that you can usually do well in a (RTDB-triggered) Cloud Function, but it's also possible to do it through a multi-path write operation from the client.

Optimising Firestore costs for time series data?

I have been using Postgres to store time-series sensor data but I am weighing the cost of using Firestore cause I prefer the serverless nature of Firestore. My only concern is the cost of Firestore because I am paying for every read. I want to be able to display this sensor information on my web app. Now, I am taking data every 10 seconds and theres over 400+ sensor points (400 columns per row in my postgres table)
Currently, if a user queries for a week's work of data that's about 60,000 rows of data, but I optimise it by just taking every nth value to "feather" the data. So by taking every 20th row for example, I have reduced the return of the data to 3000 rows which is manageable and still the chart shows a clear trend.
I want to be able to do this in Firestore to save costs, because if a user queries for a week's data, I am paying for 60000 document reads which I can't display all those data points on the web app anyway. I have tried searching for ways to query firestore to take the Nth row of data, but haven't found any concrete solutions.
Does anybody have any recommendation how I can optimise my Firestore costs for time series data or perhaps any other cheap serverless methods to manage this data?
Firestore doesn't offer any way to "feather" data from queries, as you say. What you could do instead is put an integer in each document that describes its "Nth" value, then query for only those "N" that you want.

Storing Time-Series Data of different resolution in DynamoDB

I am wondering if anyone knows a good way to store time series data of different time resolutions in DynamoDB.
For example, I have devices that send data to DynamoDB every 30 seconds. The individual readings are stored in a Table with the unique device ID as the Hash Key and a timestamp as the Range Key.
I want to aggregate this data over various time steps (30 mins, 1 hr, 1 day etc.) using a lambda and store the aggregates in DynamoDB as well. I then want to be able to grab any resolution data for any particular range of time, 48 30 minute aggregates for the last 24hrs for instance, or each daily aggregate for this month last year.
I am unsure if each new resolution should have its own tables, data_30min, data_1hr etc or if a better approach would be something like making a composite Hash Key by combining the resolution with the Device ID and storing all aggregate data in a single table.
For instance if the device ID is abc123 all 30 minute data could be stored with the Hash Key abc123_30m and the 1hr data could be stored with the HK abc123_1h and each would still use a timestamp as the range key.
What are some pros and cons to each of these approaches and is there a solution I am not thinking of which would be useful in this situation?
Thanks in advance.
I'm not sure if you've seen this page from the tech docs regarding Best Practices for storing time series data in DynamoDB. It talks about splitting your data into time periods such that you only have one "hot" table where you're writing and many "cold" tables that you only read from.
Regarding the primary/sort key selection, you should probably use a coarse timestamp value as the primary key and the actual timestamp as a sort key. Otherwise, if your periods are coarse enough, or each device only produces a relatively small amount of data then your idea of using the device id as the hash key could work as well.
Generating pre-aggregates and storing in DynamoDb would certainly work though you should definitely consider having separate tables for the different granularities you want to support. Beware of mutating data. As long as all your data arrives in order and you don't need to recompute old data, then storing pre-aggregated time series is fine but if data can mutate, or if you have to account for out-of order/late arriving data then things get complicated.
You may also consider a relational database for the "hot" data (ie. last 7 days, or whatever period makes sense) and then, running a batch process to pre-aggregate and move the data into cold, read-only DynamoDB tables, with DAX etc.

Local Cube - Is there a reason to use OLTP's grain?

I am building a local OLAP cube based on data gathered from several OLTP sources. Please note that I am doing this programmatically and do not have access to tools like SSAS or MDX-based tools.
My requirements are somewhat different than the operational requirements of the OLTP system users. I know that "in theory" it would be preferable to retain the most atomic grain available to me, but I don't see a reason to include the lowest level of data in the cube.
For example (I am simplifying), I have a measure field like "Price". Additionally, each sales fact has a Version attribute with values such as:
List (Original/Initial)
Initial Quote
Adjusted Quote
Sold
These describe the internal development of our pricing and are critical to the reports that I create.
However, for my reporting purposes, I will always want to know the value of all Versions whenever I am referencing a given transaction. Therefore, I am considering pivoting measures like Price by Version in the cube (Version will still be its own entity in the data model), resulting in measures like:
PriceList
PriceQuotedInitial
PriceQuotedAdjusted
PriceSold
Since only one Version is ever effective at a given point in time, we do not need to aggregate across multiple Versions.
Known Advantages
Since this will be a local cube file, it appears this approach would
simplify the creation of several required calculated measures that compare Price
across different Versions (would not be an issue to create calculated measures at various levels of aggregation if I was doing this with MDX)
It would also reduce the number of records by a factor of between 3
and 6, which would significantly boost performance for a local cube.
Known Disadvantages
While the data model will match the business process, the cube would not store the data at the most atomic level. An analyst would need to distinguish between Versions by Measure selection, and could not filter by Version - they would always get all available Versions.
This approach will greatly increase the number of Measures. For
example, there is not just one Price we are tracking, but several
price components and other Measures we track for each transaction.
So if we track a dozen true Measures for each transaction, that
might end up being 50-60 Measures if I take this approach.
I understand that for very large Fact tables, it would be preferable to factor all possible fields out of the Fact table into Dimensions for performance purposes, but I am not sure whether this is the case when using a local cube, as in all likelihood, I will put fewer than 50,000 records into any given cube file, given the limitations of local cubes.
Are there other drawbacks to this approach that I'm missing?

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources