DynamoDB - Global Secondary Index on set items - amazon-dynamodb

I have a dynamo table with the following attributes :
id (Number - primary key )
title (String)
created_at (Number - long)
tags (StringSet - contains a set of tags say android, ios, etc.,)
I want to be able to query by tags - get me all the items tagged android. How can I do that in DynamoDB? It appears that global secondary index can be built only on ScalarDataTypes (which is Number and String) and not on items inside a set.
If the approach I am taking is wrong, an alternative way for doing it either by creating different tables or changing the attributes is also fine.

DynamoDB is not designed to optimize indexing on set values. Below is a copy of the amazon's relevant documentation (from Improving Data Access with Secondary Indexes in DynamoDB).
The key schema for the index. Every attribute in the index key schema
must be a top-level attribute of type String, Number, or Binary.
Nested attributes and multi-valued sets are not allowed. Other
requirements for the key schema depend on the type of index: For a
global secondary index, the hash attribute can be any scalar table
attribute. A range attribute is optional, and it too can be any scalar
table attribute. For a local secondary index, the hash attribute must
be the same as the table's hash attribute, and the range attribute
must be a non-key table attribute.
Amazon recommends creating a separate one-to-many table for these kind of problems. More info here : Use one to many tables

This is a really old post, sorry to revive it, but I'd take a look at "Single Table Design"
Basically, stop thinking about your data as structured data - embrace denormalization
id (Number - primary key )
title (String)
created_at (Number - long)
tags (StringSet - contains a set of tags say android, ios, etc.,)
Instead of a nosql table with a "header" of this:
id|title|created_at|tags
think of it like this:
pk|sk |data....
id|id |{title, created_at}
id|id+tag|{id, tag} <- create one record per tag
You can still return everything by querying for pk=id & sk begins with id and join the tags to the id records in your app logic
and you can use a GSI to project id|id+tag into tag|id which will still require you to write two queries against your data to get items of a given tag (get the ids then get the items), but you won't have to duplicate your data, you wont have to scan and you'll still be able to get your items in one query when your access pattern doesn't rely on tags.
FWIW I'd start by thinking about all of your access patterns, and from there think about how you can structure composite keys and/or GSIs
cheers

You will need to create a separate table for this query.
If you are interested in fetching all items based on a tag then I suggest keeping a table with a primary key:
hash: tag
range: id
This way you can use a very simple Query to fetch all items by tag.

Related

How to filter DynamoDb by object property value

I have a DynamoDB table:
How shoul I filter entried in DB table where all keys are: access.role = "ADMIN"?
You would be best served by setting up an Global Index (GSI). You set the Partition Key equal to that attribute, and the Sort Key equal to some other attribute that you can guarantee will be unique. Then you use your SDK of choice or the Query option in the console, select the index, and query for partion_key = ADMIN
However. Be aware. Index's are a complete replication of the table. Dynamo is very good at this and relatively fast at doing so, but there is still the possibility that your index will be out of sync with the actual data. If you are not making the call against the index very often you are pretty much fine. If you are calling it very often, then you should restructure your table.
Dynamo is not an SQL. When setting up a dynamo schema you have to consider how you will access your data. your Access Patterns. You should design your data with your Partition Key as the data you will have when looking up (Ie: i always will have a user ID number) and your sort keys as the individual documents related to that PK (ie: a user has a document that is his profile data, a document that is his profile picture url, a document that is a list of his friends user numbers, a document that is ... ect)
Then you use Indexs for things like your question that you wont be doing very often.

Querying on Global Secondary indexes with a usage of contains operator

I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not

DynamoDB sub item filter using .Net Core API

First of all, I have table structure like this,
Users:{
UserId
Name
Email
SubTable1:[{
Column-111
Column-112
},
{
Column-121
Column-122
}]
SubTable2:[{
Column-211
Column-212
},
{
Column-221
Column-222
}]
}
As I am new to DynamoDB, so I have couple of questions regarding this as follows:
1. Can I create structure like this?
2. Can we set primary key for subtables?
3. Luckily, I found DynamoDB helper class to do some operations into my DB.
https://www.gopiportal.in/2018/12/aws-dynamodb-helper-class-c-and-net-core.html
But, don't know how to fetch only perticular subtable
4. Can we fetch only specific columns from my main table? Also need suggestion for subtables
Note: I am using .net core c# language to communicate with DynamoDB.
Can I create structure like this?
Yes
Can we set primary key for subtables?
No, hash key can be set on top level scalar attributes only (String, Number etc.)
Luckily, I found DynamoDB helper class to do some operations into my DB.
https://www.gopiportal.in/2018/12/aws-dynamodb-helper-class-c-and-net-core.html
But, don't know how to fetch only perticular subtable
When you say subtables, I assume that you are referring to Array datatype in the above sample table. In order to fetch the data from DynamoDB table, you need hash key to use Query API. If you don't have hash key, you can use Scan API which scans the entire table. The Scan API is a costly operation.
GSI (Global Secondary Index) can be created to avoid scan operation. However, it can be created on scalar attributes only. GSI can't be created on Array attribute.
Other option is to redesign the table accordingly to match your Query Access Pattern.
Can we fetch only specific columns from my main table? Also need suggestion for subtables
Yes, you can fetch specific columns using ProjectionExpression. This way you get only the required attributes in the result set

DynamoDB data model secondary index search

Folks,
Given we have to store the following shopping cart data:
userID1 ['itemID1','itemID2','itemID3']
userID2 ['itemID3','itemID2','itemID7']
userID3 ['itemID3','itemID2','itemID1']
We need to run the following queries:
Give me all items (which is a list) for a specific user (easy).
Give me all users which have itemID3 (precisely my question).
How would you model this in DynamoDB?
Option 1, only have the Hash key? ie
HashKey(users) cartItems
userID1 ['itemID1','itemID2','itemID3']
userID2 ['itemID3','itemID2','itemID7']
userID3 ['itemID3','itemID2','itemID1']
Option 2, Hash and Range keys?
HashKey(users) RangeKey(cartItems)
userID1 ['itemID1','itemID2','itemID3']
userID2 ['itemID3','itemID2','itemID7']
userID3 ['itemID3','itemID2','itemID1']
But it seems that range keys can only be strings, numbers, or binary...
Should this be solved by having 2 tables? How would you model them?
Thanks!
Rule 1: The range keys in DynamoDB table must be scalar, and that's why the type must be strings, numbers, boolean or binaries. You can't take a list, set, or a map type.
Rule 2: You cannot (currently) create a secondary index off of a nested attribute. From the Improving Data Access with Secondary Indexes in DynamoDB documentation. That means, you can not index the cartItems since it's not a top level JSON attribute. You may need another table for this.
So, the simple answer to your question is another question: how do you use your data?
If you query the users with input item (say itemID3 in your case) infrequently, perhaps a Scan operation with filter expression will work just fine. To model your data, you may use the user id as the HASH key and cartItems as the string set (SS type) attribute. For queries, you need to provide a filter expression for the Scan operation like this:
contains(cartItems, :expectedItem)
and, provide the value itemID3 for the placeholder :expectedItem in parameter valueMap.
If you run many queries like this frequently, perhaps you can create another table taking the item id as the HASH key, and set of users having that item as the string set attribute. In this case, the 2nd query in your question turns out to be the 1st query in the other table.
Be aware of that, you need to maintain the data at two tables for each CRUD action, which may be trivial with DynamoDB Streams.

Does dynamodb support something like an "in" clause in its queries?

Say I have table of photos and users.
Given I have a list of users I'm following [user1,user2,...] and I want to get a list of photos of people I'm following.
How can I query the table of photos where photo.createdBy in [user1,user2,user3...]
I saw that dynamodb has a batch operation, but that takes a primary key, and in this case we would be querying against a secondary index (createdBy).
Is there a way to do a query like this in dynamodb?
If you are querying purely on photo.createdBy, then you should create a global secondary index:
To speed up queries on non-key attributes, you can create a global secondary index. A global secondary index contains a selection of attributes from the table, but they are organized by a primary key that is different from that of the table. The index key does not need to have any of the key attributes from the table; it doesn't even need to have the same key schema as a table.
This will, of course, only retrieve one item. To limit results when returning more items, use a FilterExpression:
With a Query or a Scan operation, you can provide an optional filter expression to refine the results returned to you. A filter expression lets you apply conditions to the data after it is queried or scanned, but before it is returned to you. Only the items that meet your conditions are returned.
This can be applied to a Filter or Scan, but be careful of using too many Read Capacity Units when scanning for matching entries.

Resources