InfluxDB optimize storage for 2.7 billion series and more - bigdata

We're looking to migrate some data into InfluxDB. I'm working with InfluxDB 2.0 on a test server to determine the best way to stock our data.
As of today, I have about 2.7 billion series to migrate to InfluxDB but that number will only go up.
Here is the structure of the data I need to stock:
ClientId (332 values as of today, string of 7 characters)
Driver (int, 45k values as of today, will increase)
Vehicle (int, 28k values as of today, will increase)
Channel (100 values, should not increase, string of 40 characters)
value of the channel (float, 1 value per channel/vehicle/driver/client at a given timestamp)
At first, I thought of stocking my data this way:
One bucket (as all data have the same data retention)
Measurements = channels (so 100 kind of measurements are stocked)
Tag Keys = ClientId
Fields = Driver, Vehicle, Value of channel
This gave me a cardinality of 1 * 100 * 332 * 3 = 99 600 according to this article
But then I realized that InfluxDB handle duplicate based on "measurement name, tag set, and timestamp".
So for my data, this will not work, as I need the duplicate to be based on ClientId, Channel, Vehicle at the minimum.
But if I change my data structure to be stored this way:
One bucket (as all data have the same data retention)
Measurements = channels (so 100 kind of measurements are stocked)
Tag Keys = ClientId, Vehicle
Fields = Driver, Value of channel
then I'll get a cardinality of 2 788 800 000.
I understand that I need to keep cardinality as low as possible. (And ideally I would even need to be able to search by driver as well as by vehicle.)
My questions are:
If I split the data into different buckets (ex: 1 bucket per clientId), will it decrease my cardinality?
What would be the best way to stock data for such a large amount of series?

Related

DynamoDB - Extract date and Query

I am having the following table in my DynamoDB.
I want to get/extract all the data using the following condition or filters
This Month data : This will be the set of records that belongs to 1st of this month to today. ( I think this I can achieve using the BEGINS_WITH filter , again not sure whether this is the correct approach )
This Quarter data : This will be the set of records that belongs to this quarter, basically from 1st of April 2021 to 30th June 2021
This Year data : This will be set of records that belongs to this entire year
Question : How I can filter/query the data using the date column from the above table to get these 3 types (Month , Quarter ,Year ) of data.
Other Details
Table Size : 25 GB
Item Count : 4,081,678
It looks like you have time-based access patterns (e.g. fetch by month, quarter, year, etc).
Because your sort key starts with a date, you can implement your access patterns using the between condition on your sort key. For example (in pseudo code):
Fetch User 1 data for this month
query where user_id = 1 and date between 2021-06-01 and 2021-06-30
Fetch User 1 data for this quarter
query where user_id = 1 and date between 2021-01-01 and 2021-03-31
Fetch User 1 data for this month
query where user_id = 1 and date between 2021-06-01 and 2021-06-30
If you need to fetch across all users, you could use the same approach using the scan operation. While scan is commonly considered wasteful/inefficient, it's a fine approach if you run this type of query infrequently.
However, if this is a common access pattern, you might want to consider re-organizing your data to make this operation more efficient.
As mentioned in the above answer by #Seth Geoghegan , the above table design is not correct, ideally you should think before placing your Partition Key and Sort Key, still for the people like me who already have such kind of scenarios, here is the steps which I followed to mitigate my issue.
Enabled DynamoDB Steams
Re-trigger the data so that they can pass through the DDB Streams ( I added one additional column updated_dttm to all of my records using one of the script )
Process the Streams record , in my case I broken down the date column above to three more columns , event_date , category , sub_category respectively and updated back to the original record using the Lambda
Then I was able to query my data using event_date column , I can also create index over event_date column and make my query/search more effective
Points to Consider
Cost for updating records so that they can go to DDB Streams
Cost of reprocessing the records
Cost for updating records back to DDB

To query Last 7 days data in DynamoDB

I have my dynamo db table as follows:
HashKey(Date) ,RangeKey(timestamp)
DB stores the data of each day(hash key) and time stamp(range key).
Now I want to query data of last 7 days.
Can i do this in one query? or do i need to call dbb 7 times for each day? order of the data does not matter So, can some one suggest an efficient query to do that.
I think you have a few options here.
BatchGetItem - The BatchGetItem operation returns the attributes of one or more items from one or more tables. You identify requested items by primary key. You could specify all 7 primary keys and fire off a single request.
7 calls to DynamoDB. Not ideal, but it'd get the job done.
Introduce a global secondary index that projects your data into the shape your application needs. For example, you could introduce an attribute that represents an entire week by using a truncated timestamp:
2021-02-08 (represents the week of 02/08/21T00:00:00 - 02/14/21T12:59:59)
2021-02-16 (represents the week of 02/15/21T00:00:00 - 02/22/21T12:59:59)
I call this a "truncated timestamp" because I am effectively ignoring the HH:MM:SS portion of the timestamp. When you create a new item in DDB, you could introduce a truncated timestamp that represents the week it was inserted. Therefore, all items inserted in the same week will show up in the same item collection in your GSI.
Depending on the volume of data you're dealing with, you might also consider separate tables to segregate ranges of data. AWS has an article describing this pattern.

Table design for storing immutable time series telemetry/sensor data?

I'm looking for some advice on a DynamoDB table design to store telemetry data streaming in from 1000's of sensor hubs. The sensor hubs send up to 15,000 messages per day each, containing the following:
timestamp (unix time)
station_id (uuid)
sensor_type (string)
sensor_data (json)
I've looked into best practices for storing time series data, and will adopt a table partitioning strategy, where a new "hot data" table is created each month (and adjust RCU's and WCU's accordingly for older "cooler" tables).
What i'm not sure about is picking a suitable hash key and sort key, as well as setting up indexes, etc.
The majority of the queries to data will be: Give me messages where station_id = "foo" and sensor_type = "bar", and timestamp is between x and y.
At a guess, i'm assuming I would use station_id as the hash key, and timestamp as the sort key, but how do a query for messages with a particular sensor_type without resorting to filters? Would I be best to combine the station_id and sensor_type as the hash key?
Judging from the query example that you've provided I would do create the following table:
stationId_sensorType (String, partition key) - a combined attribute that contains concatenated values for station id and for sensor type
timestamp (Number, range key) - UNIX timestamp that you can use to sort by time stamp or to find only record with timestamps in range.
This will allow to get all values for a pair of (stationId, sensorType).
You can also store stationId and sensorType as separate fields in your items and then you can create GSI on them to support other queries, like, get all values for a stationId.

Applying calculations on large data set

I'm currently optimizing our data warehouse and processes which uses it and i'm looking for some suggestions.
The problem is that i'm not sure about the calculations on retrieved data.
For make things more clearer for example we have following data stucture:
id : 1
param: static_value
param2: static_value
And let's consider that we got about 50 million entries with this structure.
Also let's assume that we are querying this data set about 30 times per minute which results every time at least 10k entries.
So, in short we got these stats:
Data set: 50 million entries.
Access frequency: 30 / s.
Resulting data size: ~10k results
On every query in resulting data set i have to go thought every entry and apply on it some calculations which results a field (for example param3 ) with it's dynamic value. For example:
Query2 ( 2k results ) and one of it's entries:
id : 2
param: static_value_2
param2: static_value_2
param3: dynamic_value_2
Query3 ( 10k results ) and one of it's entries:
id : 3
param: static_value_3
param2: static_value_3
param3: dynamic_value_3
And so on..
The problem is that i can't prepare the field param3 value earlier than i get it's values by query because of many dynamic values which are used in calculations.
Main question:
Is there any guidelines, practises or even the technologies for optimizing this kind of „problems“ or implementing this kind solutions?
Thanks for any information.
Update 1:
The field "param3" is calculated on every query in every data result entry, it means that this calculated value is not stored in any storage it just computed on every query. I can't store this value because it's dynamic and depends on many variables due this reason i can't store it as static value when it's dynamic.
I guess it's not good practise to have such implementation?

Sqlite database organization

So I have a collection of recorded rat vocalizations (that's right the animal). There are several rats I have recordings from. In addition, each recording is sampled at 300e3 Hz and is several ms in duration. So each recording consists of tens of thousands of floats stored in a column called values. I've stored all this in an sqlite db, but I have a question about the optimal way to organize it. Right now I have a composite primary key (rat_id char(2), recording_# INT, time_step INT). The recording_# starts at 0 for each new rat_id, and the time_step starts at 0 for each new recording_#. The problem comes in when I try to delete a recording and deincrement all values for recording_index greater than the one I deleted. This of course takes forever because it has to do millions of operations each time this happens.
cur.execute("DELETE FROM recordings WHERE ratid=? AND recording_index=?", (self.rat, recording_index))
cur.execute("UPDATE recordings SET recording_index = -(recording_index-1) WHERE ratid=? AND recording_index>?", (self.rat, recording_index))
cur.execute("UPDATE recordings SET recording_index = -recording_index WHERE ratid=? AND recording_index<0", (self.rat,))
So is there a better way to organize the table such that the deincrement operation doesn't take as long, or should I just not bother with the deincrement operation. I don't really need to do it. I'm just being kind of overly concerned with details.

Resources