Generating IoT analytics using firestore - firebase

I have around 25 devices located in different parts of my city. Each device generates temperature and humidity data as a write in the Firestore database. Every hour I want to average the temp and humidity info and store/append it as a map in a single document.
{ DateTime: { temp: x, humidity: y }}
{ DateTime: { temp: x1, humidity: y1 }}
I'm using python to get the data and average it and then write it in the document.
Then I want to load this data in the front end to show the user analytics based on hours, days, weeks, and months. But Firestore cannot store more than 5 MB in a single document so I will run out of space fast. Is there any better way of generating analytics from stored data in collections?
Most of the people I talk to recommend switching away from firebase for such use cases, but I do believe there is a way of doing this that is generally a good practice.

Firestore cannot store more than 5 MB in a single document
According to the docs, it's actually 1 Mib the maximum size. But that is more than enough to store such data as yours.
Since you cannot store all the data of all devices in a single document, I recommend you store only the data that corresponds to a single day. If you're only storing numbers, maybe it can fit in one document the data for a week, or even for a month. So you have to measure that.
So in this way, there are no limitations. You can create as many documents as you need. And of course, you won't get "out of space".

Related

Optimising Firestore costs for time series data?

I have been using Postgres to store time-series sensor data but I am weighing the cost of using Firestore cause I prefer the serverless nature of Firestore. My only concern is the cost of Firestore because I am paying for every read. I want to be able to display this sensor information on my web app. Now, I am taking data every 10 seconds and theres over 400+ sensor points (400 columns per row in my postgres table)
Currently, if a user queries for a week's work of data that's about 60,000 rows of data, but I optimise it by just taking every nth value to "feather" the data. So by taking every 20th row for example, I have reduced the return of the data to 3000 rows which is manageable and still the chart shows a clear trend.
I want to be able to do this in Firestore to save costs, because if a user queries for a week's data, I am paying for 60000 document reads which I can't display all those data points on the web app anyway. I have tried searching for ways to query firestore to take the Nth row of data, but haven't found any concrete solutions.
Does anybody have any recommendation how I can optimise my Firestore costs for time series data or perhaps any other cheap serverless methods to manage this data?
Firestore doesn't offer any way to "feather" data from queries, as you say. What you could do instead is put an integer in each document that describes its "Nth" value, then query for only those "N" that you want.

How to write ~500 K documents every day efficiently in Firestore?

I have a Cloud Function in Python 3.7 to write/update small documents to Firestore. Each document has an user_id as Document_id, and two fields: a timestamp and a map (a dictionary) with three key-value objects, all of them are very small.
This is the code I'm using to write/update Firestore:
doc_ref = db.collection(u'my_collection').document(user['user_id'])
date_last_seen=datetime.combine(date_last_seen, datetime.min.time())
doc_ref.set({u'map_field': map_value, u'date_last_seen': date_last_seen})
My goal is to call this function one time every day, and write/update ~500K documents. I have tried the following tests, for each one I include the execution time:
Test A: Process the output to 1000 documents. Don't write/update Firestore -> ~ 2 seconds
Test B: Process the output to 1000 documents. Write/update Firestore -> ~ 1 min 3 seconds
Test C: Process the output to 5000 documents. Don't write/update Firestore -> ~ 3 seconds
Test D: Process the output to 5000 documents. Write/update Firestore -> ~ 3 min 12 seconds
My conclusion here: writing/updating Firestore is consuming more than 99% of my compute time.
Question: How to write/update ~500 K documents every day efficiently?
It's not possible to prescribe a single course of action without knowing details about the data you're actually trying to write. I strongly suggest you read the documentation about best practices for Firestore. It will give you a sense of what things you can do to avoid problems with heavy write loads.
Basically, you will want to avoid these situations, as described in that doc:
High read, write, and delete rates to a narrow document range
Avoid high read or write rates to lexicographically close documents,
or your application will experience contention errors. This issue is
known as hotspotting, and your application can experience hotspotting
if it does any of the following:
Creates new documents at a very high rate and allocates its own monotonically increasing IDs.
Cloud Firestore allocates document IDs using a scatter algorithm. You should not encounter hotspotting on writes if you create new
documents using automatic document IDs.
Creates new documents at a high rate in a collection with few documents.
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
Deletes documents in a collection at a high rate.
Writes to the database at a very high rate without gradually increasing traffic.
I won't repeat all the advice in that doc. What you do need to know is this: because of the way that Firestore is built to scale massively, limits are placed on how quickly you can write data into it. The fact that you have to scale up gradually is probably going to be your main problem that can't be solved.
I achieved my needs with batched queries. But according to Firestore documentation there is another faster way:
Note: For bulk data entry, use a server client library with
parallelized individual writes. Batched writes perform better than
serialized writes but not better than parallel writes. You should use
a server client library for bulk data operations and not a mobile/web
SDK.
I also recommend to take a look to this post in stackoverflow with examples in Node.js

Firebase document read optimization

I'm using firebase firestore for my react-native app, I'm creating a app that will send user geolocation to firestore and generate heatmap, and the app will send user's location every 5 mins, my data looks like this
Right now I have about 1000 documents, every time I refresh the app, it will try to fetch all coords to generate the heatmap.
The problem I'm having is when it generate the heatmap, it will need to read all 1000 documents, what if I have 5000 coords/documents, and I have 10 users to use it, it will reach the documents read limit in firebase free plan which is 50k/day.
I know I can pay some money to increase the read limit, but just wondering if any one run into this and find optimize way to solve it. Thanks!
I don't know all the constraints of your application, but you could possibly store all the coordinates of one month in one document, in an array, reducing by 8928 the number of document reads.
If I did the maths corrrecly, based on this documentation page https://firebase.google.com/docs/firestore/storage-size which explains the Storage size calculations, you can calculate that a doc with 3 arrays named lat, long and ts under your coords collection which stores the data for 288*31 triplet values (288 = every 5 minutes in one day) will have a size of maximum 857,088 bytes, which is under the maximum possible size for a document (i.e. 1,048,576 bytes) as presented here: https://firebase.google.com/docs/firestore/quotas
Of course, you'll have to deal with the array fields but for that you can use firebase.firestore.FieldValue.arrayUnion();, see https://firebase.google.com/docs/firestore/manage-data/add-data#update_elements_in_an_array

Storing Time-Series Data of different resolution in DynamoDB

I am wondering if anyone knows a good way to store time series data of different time resolutions in DynamoDB.
For example, I have devices that send data to DynamoDB every 30 seconds. The individual readings are stored in a Table with the unique device ID as the Hash Key and a timestamp as the Range Key.
I want to aggregate this data over various time steps (30 mins, 1 hr, 1 day etc.) using a lambda and store the aggregates in DynamoDB as well. I then want to be able to grab any resolution data for any particular range of time, 48 30 minute aggregates for the last 24hrs for instance, or each daily aggregate for this month last year.
I am unsure if each new resolution should have its own tables, data_30min, data_1hr etc or if a better approach would be something like making a composite Hash Key by combining the resolution with the Device ID and storing all aggregate data in a single table.
For instance if the device ID is abc123 all 30 minute data could be stored with the Hash Key abc123_30m and the 1hr data could be stored with the HK abc123_1h and each would still use a timestamp as the range key.
What are some pros and cons to each of these approaches and is there a solution I am not thinking of which would be useful in this situation?
Thanks in advance.
I'm not sure if you've seen this page from the tech docs regarding Best Practices for storing time series data in DynamoDB. It talks about splitting your data into time periods such that you only have one "hot" table where you're writing and many "cold" tables that you only read from.
Regarding the primary/sort key selection, you should probably use a coarse timestamp value as the primary key and the actual timestamp as a sort key. Otherwise, if your periods are coarse enough, or each device only produces a relatively small amount of data then your idea of using the device id as the hash key could work as well.
Generating pre-aggregates and storing in DynamoDb would certainly work though you should definitely consider having separate tables for the different granularities you want to support. Beware of mutating data. As long as all your data arrives in order and you don't need to recompute old data, then storing pre-aggregated time series is fine but if data can mutate, or if you have to account for out-of order/late arriving data then things get complicated.
You may also consider a relational database for the "hot" data (ie. last 7 days, or whatever period makes sense) and then, running a batch process to pre-aggregate and move the data into cold, read-only DynamoDB tables, with DAX etc.

How to store and process time-series data in Apigee App Services?

I am writing an app that will store regular temperature readings, and am looking to use Apigee App Services for the storage. However, to chart the temperature readings over time, it is inefficient to pull all the readings out over a period (e.g. a month) because there would be too many (there's one every 15 seconds or so), especially when the most common case would be to show a trend. The app could support (a) retrieving only every nth sample (for appropriate choice of n depending on the graph), (b) retrieving the average (or min, or max) of groups of n samples over the period, or (c) retrieving n, evenly spaced samples, over the period. However, it doesn't look like Apigee would support any of these using their data retrieval APIs.
I would've thought that retrieving time-series data in such a fashion is not an usual use-case, so hopefully someone's already tackled this. Is it possible?
One way you may accomplish this is by having a field (called sample_bin) that is assigned a value RANDOM(0-n) when you save it. Then, when you query the data, add in the condition that sample_bin = a specific number 0-n. This would save you from retrieving all of the records from the database to do the sampling. This should result in a more or less evenly distributed random sampling.

Resources