How to split horizontally table on multiple zones? - distributed-database

I want to split data on multiple tiKV because I have Swiss, Europeans and Americans and I need to store data in citizen country.
The user’s table has a country code and automatically data are stored in a good zone (tikv --label zone=ch/eu/us).
How can I do this ?

Related

Query DynamoDB based on whether attribute is member of a group

This is a simplified version of my problem using a DynamoDB Table. Most items in the Table represent sales across multiple countries. One of my required access patterns is to retrieve all sales in countries which belong to a certain country_grouping between a range of order_dates. The incoming stream of sales data contains the country attribute, but not the country_grouping attribute.
Another entity in the same Table is a reference table, which is infrequently updated, which could be used to identify the country_grouping for each country. Can I design a GSI or otherwise structure the table to retrieve all sales for a given country_grouping between a range of order dates?
Here's an example of the Table structure:
PK
SK
sale_id
order_date
country
country_grouping
SALE#ID#1
ORDER_DATE#2022-06-01
1
2022-06-01
UK
SALE#ID#2
ORDER_DATE#2022-09-01
2
2022-09-01
France
SALE#ID#3
ORDER_DATE#2022-07-01
3
2022-07-01
Switzerland
COUNTRY_GROUPING#EU
COUNTRY#France
France
EU
COUNTRY_GROUPING#NATO
COUNTRY#UK
UK
NATO
COUNTRY_GROUPING#NATO
COUNTRY#France
France
NATO
Possible solution 1
As the sales items are streamed into the Table, query the country_grouping associated with the country in the sale, and write the corresponding country_grouping to each sale record. Then create a GSI where country_grouping is the partition key and the order_date is the sort key. This seems expensive to me, consuming 1 RCU and 1 WCU per sale record imported. If country groupings changed (imagine the UK rejoins the EU), then I would run an update operation against all sales in the UK.
Possible solution 2
Have the application first query to retrieve every country in the desired country_grouping, then send an individual request for each country using a GSI where the partition key is country and the order_date is the sort key. Again, this seems less than ideal, as I consume 1 WCU per country, plus the 1 WCU to obtain the list of countries.
Is there a better way?
Picking an optimal solution depends on factors you haven't mentioned:
How quickly you need it to execute
How many sales records per country and country group you insert
How many sales records per country you expect there to be in the db at query time
How large a Sale item is
For example, if your Sale items are large and/or you insert a lot every second, you're going to need to worry about creating a hot key in the GSI. I'm going to assume your update rate is not too high, the Sale item size isn't too large, and you're going to have thousands or more Sale items per country.
If my assumptions are correct, then I'd go with Solution 2. You'll spend one read unit (it's not a WCU but rather an RCU, and it's only half a read unit if eventually consistent) to Query the country group and get a list of countries. Do one Query for each country in that group to pull all the Sale items matching the specific time range for that country. Since there are lots of matching sales, the cost is about the same. One 400 KB pull from a country_grouping PK is the same cost as 4 100 KB pulls from four different country PKs. You can also do the country Query calls in parallel, if you want, to speed execution. If you're returning megabytes of data or more, this will be helpful.
If in fact you have only a few sales per country, well, then any design will work.
Your solution 1 is probably best. The underlying issue is that PK actually defines the physical location on a server (both for an original entry or a GSI copy). You duplicate data because storage is cheap to get better performance for queries.
So if like you said UK rejoins UE, you won't be modifying the entries for GSI, AWS will create a new entry in a different location since PK changed.
How about if you put the country_grouping in the SK of the sale?
For example COUNTRY_GROUPING#EU#ORDER_DATE#2022-07-01
Then you can do a "begins with" query and avoid the GSI which will consume the extra capacity unit.
The country group lookup can be cached in memory to save some units + I wouldn't design my table around one-time events like the UK leaving. If that happens do a full scan and update everything. It's a one-time operation, not a big deal.
Also, Dynamo is not designed to store items for large periods of time. Typically you would store the sales for the past 30 days (for example) set e TTL to the items and stream them to S3 (or BigQuery) once they expire.

DynamoDB table size

I'm developing a DynamoDB table housing salary data by zip code. Here's the structure of my table:
For a given zip code, there will be a Sort Key called Meta which houses lat/lon, city, state, county, etc. In addition to Meta, I will have values in Sort Key for different sources of salary data for the given zip code. Each value of salary data will be a JSON document. I will have a lot of different data sources. For example, there are around 41K zip codes and around 1100 ONET codes which equate to around 46 million rows, give or take - just for ONET data source type.
How many rows can a DynamoDB table efficiently handle?
Is there a better approach to structuring this data?
Thank you for your time.

Storing Time-Series Data of different resolution in DynamoDB

I am wondering if anyone knows a good way to store time series data of different time resolutions in DynamoDB.
For example, I have devices that send data to DynamoDB every 30 seconds. The individual readings are stored in a Table with the unique device ID as the Hash Key and a timestamp as the Range Key.
I want to aggregate this data over various time steps (30 mins, 1 hr, 1 day etc.) using a lambda and store the aggregates in DynamoDB as well. I then want to be able to grab any resolution data for any particular range of time, 48 30 minute aggregates for the last 24hrs for instance, or each daily aggregate for this month last year.
I am unsure if each new resolution should have its own tables, data_30min, data_1hr etc or if a better approach would be something like making a composite Hash Key by combining the resolution with the Device ID and storing all aggregate data in a single table.
For instance if the device ID is abc123 all 30 minute data could be stored with the Hash Key abc123_30m and the 1hr data could be stored with the HK abc123_1h and each would still use a timestamp as the range key.
What are some pros and cons to each of these approaches and is there a solution I am not thinking of which would be useful in this situation?
Thanks in advance.
I'm not sure if you've seen this page from the tech docs regarding Best Practices for storing time series data in DynamoDB. It talks about splitting your data into time periods such that you only have one "hot" table where you're writing and many "cold" tables that you only read from.
Regarding the primary/sort key selection, you should probably use a coarse timestamp value as the primary key and the actual timestamp as a sort key. Otherwise, if your periods are coarse enough, or each device only produces a relatively small amount of data then your idea of using the device id as the hash key could work as well.
Generating pre-aggregates and storing in DynamoDb would certainly work though you should definitely consider having separate tables for the different granularities you want to support. Beware of mutating data. As long as all your data arrives in order and you don't need to recompute old data, then storing pre-aggregated time series is fine but if data can mutate, or if you have to account for out-of order/late arriving data then things get complicated.
You may also consider a relational database for the "hot" data (ie. last 7 days, or whatever period makes sense) and then, running a batch process to pre-aggregate and move the data into cold, read-only DynamoDB tables, with DAX etc.

Need Inputs for Database Structure in firebase

1.My query is regarding Firebase database structure for the below requirement and how to normalize the data.
I will have 0.5M records of members with classified functions under team categories, identified across multiple states -> regions -> zones.
2.Next, when a new member fills the form, the region tab should auto-populate based on the state chosen. So is the case of zones.
As a newbie to Firebase, I have planned to create States key_id and key_ids by push method for regions as well. Will this help in addressing most of operational queries on members with functions across state / regions / zones.
3.Will normalizing the data based on districts / members help.
I plan to duplicate the state, region & team category attributes in Members data to have efficient structure and reduced query time on vast no# of records.

Structuring Data In Firebase

I'm contemplating using Firebase for an upcoming project, but before doing so I want to make sure I can develop a data structure that will meet my purposes. I'm interested in tracking horse race results for approximately 25 racetracks across the US. My initial impression was that my use case aligned nicely with the Firebase Weather Data Set. The Weather data set is organized by city and then in various time series: currently, hourly, daily and minutely.
My initial thought was that I could follow a similar approach and use the 25 tracks as cities and then organize by years, months, days and races.
This structure lends itself nicely to accessing data from a particular track, but suppose that also want to access data across all tracks. For example, access data for all tracks for races that occurred in 2014 and had more than 10 horses.
Questions:
Does my proposed data structure limit me to queries by track only or would I still be able to query across tracks, years, days, months, etc. and incorporate any and all of the various meta data attributes: number of horses, distance of race, etc.
Given, my interest in freeform querying is there another data structure that would be more advantageous?
Is Firebase similar to Mongodb and have issues with collections (lists) that grow or can one continue to push to a list without pre allocating or worrying about sharding?
I believe my confusion stems from url/path nature of the data storage.
EDIT:
Here is a sample of what I had in mind:
Thanks for your input.
I would think that you would want to organize by horse first. I guess it depends what you are deriving from the data.
One horse could be at different tracks.
Horses table
* Horsename
-----Date
-----Track
-----Racenumber
-----Gate
-----Jockey
-----Place
-----Odds
-----Mud?
Races table
----Track
----Racenumber
----Date
----Time
----NumberOfHorses
Link the tables and you could get at any one part of it.

Resources