I'm developing a DynamoDB table housing salary data by zip code. Here's the structure of my table:
For a given zip code, there will be a Sort Key called Meta which houses lat/lon, city, state, county, etc. In addition to Meta, I will have values in Sort Key for different sources of salary data for the given zip code. Each value of salary data will be a JSON document. I will have a lot of different data sources. For example, there are around 41K zip codes and around 1100 ONET codes which equate to around 46 million rows, give or take - just for ONET data source type.
How many rows can a DynamoDB table efficiently handle?
Is there a better approach to structuring this data?
Thank you for your time.
Related
This is a simplified version of my problem using a DynamoDB Table. Most items in the Table represent sales across multiple countries. One of my required access patterns is to retrieve all sales in countries which belong to a certain country_grouping between a range of order_dates. The incoming stream of sales data contains the country attribute, but not the country_grouping attribute.
Another entity in the same Table is a reference table, which is infrequently updated, which could be used to identify the country_grouping for each country. Can I design a GSI or otherwise structure the table to retrieve all sales for a given country_grouping between a range of order dates?
Here's an example of the Table structure:
PK
SK
sale_id
order_date
country
country_grouping
SALE#ID#1
ORDER_DATE#2022-06-01
1
2022-06-01
UK
SALE#ID#2
ORDER_DATE#2022-09-01
2
2022-09-01
France
SALE#ID#3
ORDER_DATE#2022-07-01
3
2022-07-01
Switzerland
COUNTRY_GROUPING#EU
COUNTRY#France
France
EU
COUNTRY_GROUPING#NATO
COUNTRY#UK
UK
NATO
COUNTRY_GROUPING#NATO
COUNTRY#France
France
NATO
Possible solution 1
As the sales items are streamed into the Table, query the country_grouping associated with the country in the sale, and write the corresponding country_grouping to each sale record. Then create a GSI where country_grouping is the partition key and the order_date is the sort key. This seems expensive to me, consuming 1 RCU and 1 WCU per sale record imported. If country groupings changed (imagine the UK rejoins the EU), then I would run an update operation against all sales in the UK.
Possible solution 2
Have the application first query to retrieve every country in the desired country_grouping, then send an individual request for each country using a GSI where the partition key is country and the order_date is the sort key. Again, this seems less than ideal, as I consume 1 WCU per country, plus the 1 WCU to obtain the list of countries.
Is there a better way?
Picking an optimal solution depends on factors you haven't mentioned:
How quickly you need it to execute
How many sales records per country and country group you insert
How many sales records per country you expect there to be in the db at query time
How large a Sale item is
For example, if your Sale items are large and/or you insert a lot every second, you're going to need to worry about creating a hot key in the GSI. I'm going to assume your update rate is not too high, the Sale item size isn't too large, and you're going to have thousands or more Sale items per country.
If my assumptions are correct, then I'd go with Solution 2. You'll spend one read unit (it's not a WCU but rather an RCU, and it's only half a read unit if eventually consistent) to Query the country group and get a list of countries. Do one Query for each country in that group to pull all the Sale items matching the specific time range for that country. Since there are lots of matching sales, the cost is about the same. One 400 KB pull from a country_grouping PK is the same cost as 4 100 KB pulls from four different country PKs. You can also do the country Query calls in parallel, if you want, to speed execution. If you're returning megabytes of data or more, this will be helpful.
If in fact you have only a few sales per country, well, then any design will work.
Your solution 1 is probably best. The underlying issue is that PK actually defines the physical location on a server (both for an original entry or a GSI copy). You duplicate data because storage is cheap to get better performance for queries.
So if like you said UK rejoins UE, you won't be modifying the entries for GSI, AWS will create a new entry in a different location since PK changed.
How about if you put the country_grouping in the SK of the sale?
For example COUNTRY_GROUPING#EU#ORDER_DATE#2022-07-01
Then you can do a "begins with" query and avoid the GSI which will consume the extra capacity unit.
The country group lookup can be cached in memory to save some units + I wouldn't design my table around one-time events like the UK leaving. If that happens do a full scan and update everything. It's a one-time operation, not a big deal.
Also, Dynamo is not designed to store items for large periods of time. Typically you would store the sales for the past 30 days (for example) set e TTL to the items and stream them to S3 (or BigQuery) once they expire.
Quick question on modeling data for a customer …
Customer stores Store data, about 250 records, maybe 10 properties each.
Customer stores Department data, about 1,000 record, again, maybe 10 properties each.
Customer stores Product data, about 2,000,000 records, maybe 20 properties each.
My thoughts for modeling this data, based on how it is accessed is to store Store data and Department data in a lookups collection, partitioned on the object property, in this case, Store of Department.
Store the Product data in a products collection, partitioned on the upc_code property.
Does this make sense? Or is there a better way, specifically with handling small (< 1,000 records) datasets, should I recommend Table Storage for any of this?
Thanks in advance!
Yes, that could work. I wouldn't use the Table API for this though. If you want key/value features use SQL API and turn off indexing. But only if you look up product by upc_code.
Another question. Is this data all related? Have you looked at possibly storing this as a graph and using the Gremlin API in Cosmos?
I am wondering if anyone knows a good way to store time series data of different time resolutions in DynamoDB.
For example, I have devices that send data to DynamoDB every 30 seconds. The individual readings are stored in a Table with the unique device ID as the Hash Key and a timestamp as the Range Key.
I want to aggregate this data over various time steps (30 mins, 1 hr, 1 day etc.) using a lambda and store the aggregates in DynamoDB as well. I then want to be able to grab any resolution data for any particular range of time, 48 30 minute aggregates for the last 24hrs for instance, or each daily aggregate for this month last year.
I am unsure if each new resolution should have its own tables, data_30min, data_1hr etc or if a better approach would be something like making a composite Hash Key by combining the resolution with the Device ID and storing all aggregate data in a single table.
For instance if the device ID is abc123 all 30 minute data could be stored with the Hash Key abc123_30m and the 1hr data could be stored with the HK abc123_1h and each would still use a timestamp as the range key.
What are some pros and cons to each of these approaches and is there a solution I am not thinking of which would be useful in this situation?
Thanks in advance.
I'm not sure if you've seen this page from the tech docs regarding Best Practices for storing time series data in DynamoDB. It talks about splitting your data into time periods such that you only have one "hot" table where you're writing and many "cold" tables that you only read from.
Regarding the primary/sort key selection, you should probably use a coarse timestamp value as the primary key and the actual timestamp as a sort key. Otherwise, if your periods are coarse enough, or each device only produces a relatively small amount of data then your idea of using the device id as the hash key could work as well.
Generating pre-aggregates and storing in DynamoDb would certainly work though you should definitely consider having separate tables for the different granularities you want to support. Beware of mutating data. As long as all your data arrives in order and you don't need to recompute old data, then storing pre-aggregated time series is fine but if data can mutate, or if you have to account for out-of order/late arriving data then things get complicated.
You may also consider a relational database for the "hot" data (ie. last 7 days, or whatever period makes sense) and then, running a batch process to pre-aggregate and move the data into cold, read-only DynamoDB tables, with DAX etc.
I am currently working on a project that collects a customers demographics weekly and stores the delta (from previous weeks) as a new record. This process will encompass 160 variables and a couple hundred million people (my management and a consulting firm requires this, although ~100 of the variables are seemingly useless). These variables will be collected from 9 different tables in our Teradata warehouse.
I am planning to split this into 2 tables.
Table with often used demographics (~60 variables sourced from 3 tables)
Normalized (1 customer id and add date for each demographic variable)
Table with rarely or unused demographics (~100 variables sourced from 6 tables)
Normalized (1 customer id and add date for each demographic variable)
MVC is utilized to save as much space as possible as the database it will live on is limited in size due to backup limitations. (to note the customer id currently consumes 30% (3.5gb) of the table 1's size, so additional tables would add that storage cost)
The table(s) will be accessed by finding the most recent record in relation to the date the Analyst has selected:
SELECT cus_id,demo
FROM db1.demo_test
WHERE (cus_id,add_dt) IN (
SELECT cus_id, MAX(add_dt)
FROM db1.dt_test
WHERE add_dt <= '2013-03-01' -- Analyst selected Point-in-Time Date
GROUP BY 1)
GROUP BY 1,2
This data will be used for modeling purposes, so a reasonable SELECT speed is acceptable.
Does this approach seem sound for storage and querying?
Is any individual table too large?
Is there a better suggested approach?
My concern with splitting further is
Space due to uncompressible fields such as dates and customer ids
Speed with joining 2-3 tables (I suspect an inner join may use very little resources.)
Please excuse my ignorance in this matter. I usually work with large tables that do not persist for long (I am a Data Analyst by profession) or the tables I build for long term data collection only contain a handful of columns.
Additional to Rob's remarks:
What is your current PI/partitioning?
Is the current performance unsatisfactory?
How do the analysts access beside the point-in-time, any other common conditions?
Depending on your needs a (prev_dt, add_dt) might be better than a single add_dt. More overhead to load, but querying might be as simple as date ... between prev_dt and end_dt.
A Join Index on (cus_id), (add_dt) might be helpful, too.
You might replace the MAX(subquery) with a RANK (MAX is usually slower, only when cus_id is the PI RANK might be worse):
SELECT *
FROM db1.demo_test
QUALIFY
RANK() OVER (PARTITION BY cus_id ORDER BY add_dt DESC) = 1
In TD14 you might split your single table in two row-containers of a column-partitioned table.
...
The width of the table at 160 columns, sparsely populated is not necessarily an incorrect physical implementation (normalized in 3NF or slightly de-normalized). I have also seen situations where attributes not regularly accessed are moved to a documentation table. If you elect to implement the latter in your physical implementation it would be in your best interest that each table share the same primary index. This allows the joining of these to tables (60 attributes and 100 attributes) to be AMP-local on Teradata.
If the access of the table(s) will also include the add_dt column you may wish create a partitioned primary index on this column. This will allow the optimizer to eliminate the other partitions from being scanned when the add_dt column is included in the WHERE clause of a query. Another option would be to test the behavior of a value ordered secondary index on the add_dt column.
For one of my clients I have to import a CSV of Medicare plans provided by the government (part one provided here) into Drupal 7. There are about 500,000 rows of data in that CSV, most of which differ only by the FIPS County code field - basically, every county that a plan is available in counts as one row.
Should I import all 500k rows into Drupal 7 as individual nodes, or create a single node for every plan and put the numerous FIPS codes associated with that plan in a multi-value text field? I opted for the latter route to begin with, however when I looked in the plan database it looks like some plans are available in more than 10,000 counties. I'd like to find the most efficient, Drupal-esque solution to storing all these plans and where they are available.
Generally it is very useful to avoid storing any duplicate data, so you are right, create 500k rows as individual nodes is a bad idea. I would rather create two content types (using CCK):
Medicare Plan
FIPS County code (or maybe just County)
And then create a many-to-many relationship between them (using CCK Node Reference, maybe Corresponding node references for mutual relationships if needed).
You can then create a view that will list all FIPS County codes attached to a particular Medicare Plan.
I ended up going with a row per plan - as it turned out, there were subtle differences between them that I missed. Thanks to all who answered!