In a dynamo table I would like to query by selecting all items where an attributes value matches one of a set of values. For example my table has a current_status attribute so I would like all items that either have a 'NEW' or 'ASSIGNED' value.
If I apply a GSI to the current_status attribute it looks like I have to do this in two queries? Or instead do a scan?
DynamoDB does not recommend using scan. Use it only when there is no other option and you have fairly small amount of data.
You need use GSIs here. Putting current_status in PK of GSI would result in hot
partition issue.
The right solution is to put random number in PK of GSI, ranging from 0..N, where N is number of partitions. And put the status in SK of GSI, along with timestamp or some unique information to keep PK-SK pair unique. So when you want to query based on current_status, execute N queries in parallel with PK ranging from 0..N and SK begins_with current_status. N should be decided based on amount of data you have. If the data on each row is less than 4kb, then this parallel query operation would consume N read units without hot partition issue. Below link provides the details information on this
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-modeling-nosql-B.html
Related
I am new to dynamodb & was having some trouble in finding a way to randomly getting items without a full table scan ,most of the algorithms that i found consist of full table scans
I am also taking the case where we don’t have additional information of the table(Like columns and column Type such info is unknown)
Is there a way exist to do so
You can randomly sample by using a randomly generated exclusive start key for the scan or query operation. The exclusive start key does not have to match a record in the table. It just needs to follow the key structure of the table/index.
As with most questions about queries in DynamoDB, how you structure your data depends on how you want to query it.
For something like a random sampling, you have to make it confirm to the following core constraint of DynamoDB:
You have to provide a partition key
You can provide a sort key
So with a "single table" type design, you could structure your data something like this:
PK
SK
myVal
my_dict
6caaf1e3-eb8d-404a-a2ae-97d6682b0224
foo
my_dict
1c5496e8-c660-4b4e-980f-4abfb1942863
bar
my_dict
56551340-fff8-4824-a5be-70fcaece2e1a
baz
my_other_dict
520a7b37-233c-49dd-87da-77d871d98c92
test1
my_other_dict
65ccd54e-72c3-499d-a3a7-0cd989252607
test2
The PK is the identifier for your collection of random things to look up. The SK is a random UUID. And myVal contains the value you want to be returned.
You can query this db the following way:
SELECT * FROM "my-table" WHERE PK = 'my_dict' AND SK < '06a04e20-b239-48f2-a205-552eb61fef35'
By querying with an UUID as the SK, you'll get the first item in the table with an UUID close to the one you query for. By using a random uuid each time you query, you'll get a random result back.
The particular query above actually returns nothing, so you need to retry until you get a result.
Also, I haven't done the math (who has?), but I'd imagine that periodic queries like this won't generate perfectly random distributions, especially for small data sets.
Currently I use table.query to get items by matching partition key and sorted by sorting key. Now the new requirement is to handle batch query - a couple of hundred partition keys match and hopefully still sorted by sorting key in each partition key result. I find GetBatchItem that can handle up to 100 items per one query, but look like no sorting. Is one item here one row in DDB or all rows in one partition key?
From performance(query speed) and price perspective which one should I use? And do i have to do sorting for the result by myself if I use GetBatchItem? Ideally I like a solution of fast, cost effective and result sorted by sorting key in each partition key, but the first two are top priority and I can do sorting if I have to. Thanks
Query() is cheaper...
BatchGetItem() runs as individual GetItem() each costing 1 RCU (assuming your item is less than 400K).
Lets say you're item is 10K, Query() can return 40 of them for 1 RCU whereas returning 40 via BatchGetItem() will cost 40 RCU.
Say if I had a DynamoDB table:
UserId: S
BookName: S
BorrowedTimestamp: S
HasReturned: B
UserId (partition) and BookName (range) would be keys on the base table.
However I want to query using the other non-key fields e.g. BorrowedTimestamp > 3days and HasReturned is false.
I think I'd need to setup a GSI for this query to work, but it doesn't sound right having a binary field, HasReturned, as the partition key (with BorrowedTimestamp as range key). Is that correct, or am I missing something?
No, you don't need a GSI, but it might be more efficient depending on your circumstances.
Lets take your example of BorrowedTimestamp > 3days. Im going to assume this is for a particular user, so you have a userid to query.
You could do a query with a KeyConditionExpression of userid, then a FilterExpression of BorrowedTimestamp > 3days. Lets say the user has 10 books and 2 of them have a BorrowedTimestamp > 3days. This query will cost you 10 RCU (Read Capacity Units). That's because a FilterExpression just filters out items in your result set - DynamoDB actually found all 10 items in the query.
Now lets say you have a GSI where the partition key was userid and the range key was BorrowedTimestamp. Your KeyConditionExpression could specify both the parition key of the userid and the range key of BorrowedTimestamp > 3days. The result would be exactly the same. However this time it would only cost you 2 RCUs, and those RCUs would come from the index capacity not the table capacity.
Less RCUs sounds good, but remember you have to purchase throughput capacity for your primary index and GSI separately. This can be less efficient because you can't share purchased throughput between queries that use your primary key and GSI.
Finally if you didn't want to specify a userid at all you would use a scan. Scans sometimes don't scale well because they always evaluate every item in the table, but whether it works for you really depends on a lot of things (like how often you will use the scan, how many items you will have in the table etc).
I'm new to AWS DynamoDB and wanted to clarify something. Is it possible to query a table and filter base on a non-primary key attribute. My table looks like the following
Store
Id: PrimaryKey
Name: simple string
Location: simple string
Now I want to query on the Name, but I think I have to give the key as well from what I know? Apart from that I can use the scan but then I will be loading all the data.
From the docs:
The Query operation finds items based on primary key values. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
DynamoDB requires queries to always use the partition key.
In your case your options are:
create a Global Secondary Index that uses Name as a primary key
use a Scan + Filter if the table is relatively small, or if you expect the result set will include the majority of the records in the table
There are few designs principals that you can follow while you are using DynamoDB. If you are coming from a relational background, you have already witnessed the query limitations from primary key attributes.
Design your tables, for querying and separating hot and cold data.
Create Indexes for Querying from Non Key attributes (You have two options, Global Secondary Index which you can define at any time and Local Secondary Index which you need to specify at table creation time).
With the Global Secondary Index you can promote any NonKey attribute as the Partition Key for the Index and select another attribute for Sort Key for querying. For Local Secondary Index, you can promote any Non Key attribute as the Sort Key keeping the same Partition Key.
Using Indexes for query is important also to improve the efficiency in using provisioned throughput.
Although having indexes consumes the read throughput from the table, it also saves read through put from in a way that, if you project the right amount of attributes to read, it can give a huge benefit in reading. Check the following example.
Lets say you have a DynamoDB table that has items of 40KB. If you read directly from the table to list 10 items, it consumes 100 Read Throughput Units (For one item 10 Units since one unit can read 4KB and multiply it by 10). If you have an index defined just to project the attributes needed to list which will be having 4KB per item, then it will be consuming only 10 Read Throughput Units(One Unit per item) which makes a huge difference in terms of cost.
With DynamoDB its really important how you define Indexes to optimize for Querying not only from Query capability but also in terms of throughput.
You can not query based non-primary key attribute in Dynamo Db.
If you wanted to still do that you can do it using scan query,but scan is costly operation in DyanmoDB and if table is large, then it will affect performance and not recommended because it will scan each item in table and AWS cost you for all item it scan for that query.
There are two ways to achieve it
Keep Store Id as your PrimaryKey/ Partaion key of Dyanmo DB table and add Name/Location as sort Key (only one as Dyanmo DB accept only one Attribute as sort key by design.
Create Global Secondary Indexes for Querying from Non Key attributes which you are more frequenly required.
There are 3 ways to created GSI in Dyanamo DB, In your case select GSI with option INCLUDE and add Name , Location and store ID in Idex.
KEYS_ONLY – Each item in the index consists only of the table partition key and sort key values, plus the index key values. The KEYS_ONLY option results in the smallest possible secondary index.
INCLUDE – In addition to the attributes described in KEYS_ONLY, the secondary index will include other non-key attributes that you specify.
ALL – The secondary index includes all of the attributes from the source table. Because all of the table data is duplicated in the index, an ALL projection results in the largest possible secondary index.
I'm new to DynamoDB - I already have an application where the data gets inserted, but I'm getting stuck on extracting the data.
Requirement:
There must be a unique table per customer
Insert documents into the table (each doc has a unique ID and a timestamp)
Get X number of documents based on timestamp (ordered ascending)
Delete individual documents based on unique ID
So far I have created a table with composite key (S:id, N:timestamp). However when I come to query it, I realise that since my id is unique, because I can't do a wildcard search on ID I won't be able to extract a range of items...
So, how should I design my table to satisfy this scenario?
Edit: Here's what I'm thinking:
Primary index will be composite: (s:customer_id, n:timestamp) where customer ID will be the same within a table. This will enable me to extact data based on time range.
Secondary index will be hash (s: unique_doc_id) whereby I will be able to delete items using this index.
Does this sound like the correct solution? Thank you in advance.
You can satisfy the requirements like this:
Your primary key will be h:customer_id and r:unique_id. This makes sure all the elements in the table have different keys.
You will also have an attribute for timestamp and will have a Local Secondary Index on it.
You will use the LSI to do requirement 3 and batchWrite API call to do batch delete for requirement 4.
This solution doesn't require (1) - all the customers can stay in the same table (Heads up - There is a limit-before-contact-us of 256 tables per account)