This is a simplified version of my problem using a DynamoDB Table. Most items in the Table represent sales across multiple countries. One of my required access patterns is to retrieve all sales in countries which belong to a certain country_grouping between a range of order_dates. The incoming stream of sales data contains the country attribute, but not the country_grouping attribute.
Another entity in the same Table is a reference table, which is infrequently updated, which could be used to identify the country_grouping for each country. Can I design a GSI or otherwise structure the table to retrieve all sales for a given country_grouping between a range of order dates?
Here's an example of the Table structure:
PK
SK
sale_id
order_date
country
country_grouping
SALE#ID#1
ORDER_DATE#2022-06-01
1
2022-06-01
UK
SALE#ID#2
ORDER_DATE#2022-09-01
2
2022-09-01
France
SALE#ID#3
ORDER_DATE#2022-07-01
3
2022-07-01
Switzerland
COUNTRY_GROUPING#EU
COUNTRY#France
France
EU
COUNTRY_GROUPING#NATO
COUNTRY#UK
UK
NATO
COUNTRY_GROUPING#NATO
COUNTRY#France
France
NATO
Possible solution 1
As the sales items are streamed into the Table, query the country_grouping associated with the country in the sale, and write the corresponding country_grouping to each sale record. Then create a GSI where country_grouping is the partition key and the order_date is the sort key. This seems expensive to me, consuming 1 RCU and 1 WCU per sale record imported. If country groupings changed (imagine the UK rejoins the EU), then I would run an update operation against all sales in the UK.
Possible solution 2
Have the application first query to retrieve every country in the desired country_grouping, then send an individual request for each country using a GSI where the partition key is country and the order_date is the sort key. Again, this seems less than ideal, as I consume 1 WCU per country, plus the 1 WCU to obtain the list of countries.
Is there a better way?
Picking an optimal solution depends on factors you haven't mentioned:
How quickly you need it to execute
How many sales records per country and country group you insert
How many sales records per country you expect there to be in the db at query time
How large a Sale item is
For example, if your Sale items are large and/or you insert a lot every second, you're going to need to worry about creating a hot key in the GSI. I'm going to assume your update rate is not too high, the Sale item size isn't too large, and you're going to have thousands or more Sale items per country.
If my assumptions are correct, then I'd go with Solution 2. You'll spend one read unit (it's not a WCU but rather an RCU, and it's only half a read unit if eventually consistent) to Query the country group and get a list of countries. Do one Query for each country in that group to pull all the Sale items matching the specific time range for that country. Since there are lots of matching sales, the cost is about the same. One 400 KB pull from a country_grouping PK is the same cost as 4 100 KB pulls from four different country PKs. You can also do the country Query calls in parallel, if you want, to speed execution. If you're returning megabytes of data or more, this will be helpful.
If in fact you have only a few sales per country, well, then any design will work.
Your solution 1 is probably best. The underlying issue is that PK actually defines the physical location on a server (both for an original entry or a GSI copy). You duplicate data because storage is cheap to get better performance for queries.
So if like you said UK rejoins UE, you won't be modifying the entries for GSI, AWS will create a new entry in a different location since PK changed.
How about if you put the country_grouping in the SK of the sale?
For example COUNTRY_GROUPING#EU#ORDER_DATE#2022-07-01
Then you can do a "begins with" query and avoid the GSI which will consume the extra capacity unit.
The country group lookup can be cached in memory to save some units + I wouldn't design my table around one-time events like the UK leaving. If that happens do a full scan and update everything. It's a one-time operation, not a big deal.
Also, Dynamo is not designed to store items for large periods of time. Typically you would store the sales for the past 30 days (for example) set e TTL to the items and stream them to S3 (or BigQuery) once they expire.
Related
I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)
I am trying to determine the best partition key for a CosmosDB table that has both a customer ID (unique value for each customer) and customer city (in North America, which yields thousands of possible values).
Reading the Azure documentation, I see a lot of conflicting information between which one is best. Some of the documents specify that the more unique value will provide a better spread of items across partitions. While other documents state that using city would be best.
So my question(s) are:
Is each partition key hashed and does each partition contain items with keys with a range of hashes? Ie - if Customer ID is the partition key, would one partition have ID's 1 through 1000, another partition 1000 through 2000, etc? Same with city, would one partition have multiple cities? Or, would each partition be mapped 1:1 to a specific partition key - ie ID or city?
Based on the above, which one would be better (more performant, cost less)? Having as granular partition key as possible (id customer ID)? Or customer city?
Thank you!
yes, partition keys are hashed and those hashes determine where logical partitions are physically stored
no, partitions will only ever contain records with the same partition key (that's basically the point, co-locate associated records). So in your example, they would be mapped 1:1
cost is irrelevant because you aren't charged for partitions (although they do have a size limit), so the question comes down to performance, and again that all depends on how your application queries the data.
A good analogy for understanding how partitioning works is to think about finding someone's address:
If I gave you the key to my house (Item ID) but nothing else, you would need to try every door in the world until you happen to stumble upon the right one (aka cross-partition query). If I told you the country (partition key), then you can immediately eliminate a millions of doors, but you'd still have millions of doors to check, so still not very efficient. If I gave you the city, less again but still a lot to check....but if I gave you my postcode, then we've just optimized a query from billions of records to 15-20.
I'm trying to better understand using the adjacency list pattern for many to many (m:n) relationship design in AWS DynamoDB.
Looking at the AWS docs here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html we have an example with an Invoice and Bill entity with an m:n relationship.
I understand that I can get details of all bills associated with a particular invoice by reading a single partition. For example I can query for Invoice-92551 and know some attributes of the 2 bills that are associated with it based on the additional items in the partition.
My question is what do I have to do to get the full bill attributes for these 2 bills. Does this require 2 additional queries using the IDs I derived from the invoice partition, or is there some other pattern I am missing here?
Additional Details
Referencing the 2 different descriptions of Bill items in the screenshot:
Bill items in Invoice partitions: "Attributes of this bill in this invoice"
Bill items in their own partitions: "More attributes of this bill"
Does this mean that my Invoice partitions should include any Bill attributes I want to access via minimal queries? I was originally thinking the Bill partitions would contain most of what I want, but that doesn't quite make sense if I want to get at them by Invoice.
No, no additional queries - unless you ask ("project") only certain attributes, your query will retrieve all the attributes of the bills together with its key.
DynamoDB stores each partition together on a single node, so it's efficient to fetch the entire partition. This partition is defined by its "partition key" (your invoice number). The partition contains a bunch of "items" (your bills), each item has its own "sort key" (your bill ID) and any number of "attributes". When DynamoDB reads the partition, it reads those items in order, with all their attributes, and can return all of them unless you specifically asked it not to. Note that even if you ask it only to return a subset (a "projection") of these attributes, Amazon still needs to read them from disk, and you will still pay for this I/O.
You have two options: issue multiple queries or duplicate some bill data. When you query for an invoice and its bills, you'll get
More attributes of this invoice, and
Attributes of this bill in this invoice.
You will not get "More attributes of this bill" for any bills. To get those, you must query for the bills themselves. You can issue individual GetItem queries or a single BatchGetItem query with the bill IDs (limited to 100 bills per query).
Alternatively, you can duplicate some values from "More attributes of this bill" to each invoice-bill item to avoid the second query at the cost of storage and insert/update complexity.
I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.
My current schema looks like this:
user_id | timestamp | book_id | author_id
user_id is the partition key, and timestamp is the sort key.
The problem I am having is that, with this schema I am only able to query
the details of the books that a single user (partition key) has read. That is one of the requirements for me.
The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.
I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.
Retrieve the records of the books read by a single user (Can be done).
Retrieve the records of books read by all the users in last x days (Unable to do it).
I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.
One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.
The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.
I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.
One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.
Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.
You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.
Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.
Let's assume that I have a table with following attributes:
UNIQUE user_id (primary hash key)
category_id (GSI hash index)
timestamp
I will have a lot of users, but only few categories.
user_id | category_id
1 1
3 1
4 1
5 3
.. ..
50000000 1
Is it ok to store millions of records with the same category_id value as a Global Secondary Index? Should I expect any restrictions?
I'm wondering if scan is not a bad choice. I will use filtering by category_id only once a day. What is the cost (time and money) of scanning millions of records?
Thanks!
According to the Limits documentation, the only limitation is:
No practical limit for tables without local secondary indexes.
For a table with local secondary indexes, there is a limit on item collection sizes: For every distinct hash key value, the total sizes of all table and index items cannot exceed 10 GB. Depending on your item sizes, this may constrain the number of range keys per hash value. For more information, see Item Collection Size Limit.
Now for your second question of whether you should be doing Query or Scan, you asked both from performance and monetary cost. Maintaining a GSI is expensive, because you have to pay for the throughput (and if I recall correctly also the storage) so its like paying for another table, plus its another table whose throughput you have to monitor to make sure you aren't being throttled. On the other hand, the performance is much better.
If you're planning on going through all categories once a day (which means every Document in the Table), then Scan is the way to go. You aren't gaining anything from Querying. Plus its cheaper (no extra GSI) and you don't have to worry about projections.