I have a table like this:
Transports
id (PK)
createDt
shipperId
carrierId
consigneeId
1
23
contact3
contact2
contact1
2
24
contact1
contact2
contact3
3
28
contact3
contact2
contact4
My access pattern is:
find all transports where a contact was either shipper, carrier or consignee sorted by createDt. E.g. entering contact1 should return records 1, 2.
How can I do this in DyanomoDB?
I thought about creating a GSI. But then I need to create a separate GSI for each column, which would mean I need to join the query results on the columns myself. Perhaps there is an easier way.
I'd create a GSI on the table and split your single record up into multiple ones.
That would make writes slightly more complex, because you write multiple entities, but I'd do something like this:
PK
SK
type
GSI1PK
GSI1SK
other attributes
TRANSP#1
TRANSP#1
transport
createDt, (shipperId, carrierId, consigneeId)...
TRANSP#1
CONTACT#SHIP
shipper-contact
CONTACT#contact3
TRANSP#1#SHIP
...
TRANSP#1
CONTACT#CARR
carrier-contact
CONTACT#contact2
TRANSP#1#CARR
...
TRANSP#1
CONTACT#CONS
consignee-contact
CONTACT#contact1
TRANSP#1#CONS
...
To get all information about a given Transport ID you do a query with PK=TRANSP#<id>
To get just the basic information about a given Transport, you can do a GetItem on PK=TRANSP#<id> and SK=TRANSP<id> (You could also duplicate the contact infos here if they're fairly static.)
To get all transports a contact is involved in, you do a PK=CONTACT#<id> and SK starts with TRANSP on GSI1
If you really need server-side sorting, you might choose a different GSI1SK, maybe prefix it with the dt value, but I'd probably just do that client side.
Related
I have a requirement to find all users in a table that have same Id, Email or Phone.
Right now the data looks like this:
Id //hash
Market //sort
Email //gsi
Phone //gsi
I want to be able to do a query and say:
Get all items that have matching Id, email or phone.
From the docs it seems that you can only do a single query based on keys or one index. And it seems that even if I was to combine phone and email into one column and GSI that column I would still be limited to a begin with filter expression, is this correct? Are there any alternatives?
it seems that you can only do a single query based on keys or one index
Yes.
if I was to combine phone and email into one [GSI] I would still be limited to a begin with filter expression, is this correct?
Essentially, yes. Query constraints apply equally to indexes and the table keys. You must specify one-and-only-one Partition Key value, and optionally a range of Sort Key values.
Are there any alternatives?
Overload the Partition Key and denormalise the data. Redefine the Partition Key column (renamed PK) to hold Id, Email and Phone values. Each record is (fully or partially) repeated 3 times, each time with a different PK type.
PK Market Id More fields
Id-1 A Id-1 foo
zaphod#42.com A Id-1 # foo or blank
13015552572 A Id-1 # foo or blank
Querying PK = <something> AND Market > "" will return any matching id, email or phone number value.
If justified by your query patterns, repeat all fields 3x. Alternatively, use a hit on a truncated email/phone record to identify the Id, then query other fields using the Id.
There are different flavours of this pattern. For instance, you could also overload the Sort Key column (renamed to SK) with the Id value for Email and Phone records, which would permit multiple Ids per email/phone.
I have a relationship whereby each SITE can have one or more CAMERAs.
So the parent-child relatioship would be that of SITE->CAMERA[s].
The 99% of my queries will be "Give me all the cameras at a given site" and "Give me camera XYZ" and "Give me all cameras where enabled===true" -- at roughly 1:1:1 ratio.
The DynamoDB design, if I understand correctly, would be to have the partition key be 'SITE_ID' and the sort key be 'CAMERA_ID'. Done and done.
....
However, not every CAMERA belongs to a SITE. About 10% of my CAMERAs are not associated with a SITE. I could just put 'noSite' or something as the Partitionkey, but that seems like a kludge... or is it?
I'm new to DynamoDB and unsure how best to set up this relationship. I've always just used MongoDB and never spent time in the SQL world, so needing to worry about indexes isn't something I have experience with. Cost is more important than raw speed and the DB will remain somewhat small (currently around 500 cameras and likely never more than 10k).
What is the best way to set up this table?
Detailed question first: a noSite key is not a bad design choice for unassigned cameras. SiteID is important and
the key cannot be blank.
Your access patterns give you flexibility. Your low data volumes reduce the stakes of the design decisions.
What are the Partion Key and Sort Key names? Regardless of which "columns" you end up selecting for the keys, naming the keys PK and SK give you the option to add other record types in a single-table design later. This is a common practice.
What are the PK and SK columns?
You have two good options for PK and SK for your Camera records:
# Option 1 - marginally better, CameraID has the higher cardinality
PK: CameraID, SK: SiteID
# Option 2
PK: SiteID, SK: CameraID
At this point, 1 of your "queries" will be executed as a query (faster and cheaper) and the other 2 as scans (slower and more expensive). Scanning 500 records is nothing, though, so you could be "done and done" as you say.
Sooner, Later or Never
If required, we can remove the scan operations by adding secondary indexes. Secondary indexes add storage cost (records are literally duplicated) but reduce access costs. Net net change is case dependent. Performance will improve.
# Add an index to query "Give me all the cameras at a given site"
GSI1PK: SiteID, GSI1SK: CameraID # reverse your choice for primary keys
# or, to get fancy and be able to query enabled cameras by site, too, use a concatenated SK with a begins_with query
GSI1PK: SiteID, GSI1SK: Enabled#True#CameraID
# Add an index to query "Give me all cameras where enabled===true"
# Concatenate SiteID and CameraID in the GSI Sort Key to enable 2 types of queries
# 1. all enabled cameras? GSI2PK = true and GSI2SK > ""
# 2. all enabled cameras at Site123? GSI2PK = true and GSI2SK begins_with("Site123")
GSI2PK: Enabled, GSI2SK: SiteID#CameraID
We have a table like this:
user_id | video_id | timestamp
1 2 3
1 3 4
1 3 5
2 1 1
And we need to query latest timestamp for each video viewed by a specific user.
Currently it's done like this:
response = self.history_table.query(
KeyConditionExpression=Key('user_id').eq(int(user_id)),
IndexName='WatchHistoryByTimestamp',
ScanIndexForward=False,
)
It queries all timestamps for all videos of specified user, but it does way huge load to database, because there can be thousands of timestamps of thousands videos.
I tried to find solution on Internet, but as I can see, all SQL solutions uses GROUP BY, but DynamoDB has no such features
There are 2 ways I know of doing this:
Method 1 GSI Global Secondary Index
GroupBy is sort of like partition in DynamoDB, (but not really). Your partition is currently user_id I assume, but you want video_id as the partition key, and timestamp as the sort key. You can do that creating a new GSI, and specify your new sort key timestamp & partition key video_id. This gives you the ability to query for a given video, the latest timestamp, as this query will only use 1 RCU and be super fast just add --max-items 1 --page-size 1. But you will need to supply the video_id.
Method 2 Sparse Index
The problem with 1 is you need to supply an ID, whereas you might just want to have a list of videos with their latest timestamp. There are a couple of ways to do this, one way I like is using a Sparse Index, if you have an attribute, called latest & set that to true for the latest timestamp, you can create a GSI and choose that attribute key latest, but not you will have to manually set and unset this value yourself, which you have to do in lambda streams or your app.
That does seem weird but this is how NoSQL works as opposed to SQL, which I myself am battling with now on a current project, where I am having to use some of these techniques myself, each time I do it just doesn't feel right but hopefully we'll get used to it.
I'm investigating whether to use AWS DynamoDb or Azure DocumentDb or google cloud for price and simplicity for my app and am wondering what the best approach is for a typical invite schema.
An invite has
userId : key (who created the invite)
gameId : key
invitationList : collection of userIds
The queries I would be running are
Get invites where userId == me
Get invites where my userId is in the invitationList
In Mongo, I would just set an index on the embedded invitationList, and in SQL I would set up a join table of gameId and invited UserIds.
Using dynamodb or documentdb, could I do this in one "table" or would I have to set up a second denormalized table one that has an invited UserId per row with a set of invitedGameIds?
e.g.
A secondary table with
InvitedUserId : key
GameIds : Collection
Similar to hslriksen's answer, if certain criteria are met, I recommend that you denormalize all of this into a single document. Those criteria are:
The invitationList for games cannot grow unbounded.
Even if it's bounded, will a maximum length array fit in the document and transaction limits.
However, different from hslriksen, I recommend that an example document look like this:
{
gameId: <some game key>,
userId: <some user id>,
invitationList: [<user id 1>, <user id 2>, ...]
}
You might also decide to use the built-in id field for games in which case the name above is wrong.
The key difference between what I propose and hslriksen is that the invitationsList is a pure array of foreign keys. This will allow indexes to be used for an ARRAY_CONTAINS clause in your query.
Note, in DocumentDB, you would tend to store all entity types in the same big bucket and just distinguish them with a string type field or slightly better, an is_my_type boolean field.
For DocumentDB you could probably just keep this in one document per inviting user
where the document Id could equal the key of the inviting user. If you have many games, you could use gameId as partitionKey.
{
"id" : "gameKey+invitingUserKey",
"gameKey" : "someGameKey",
"invitingUserId": "key",
"invites": ["inviteKey1", "inviteKey2"]
}
This is based on a limited number of invites for a user/gameKey. It is however hard to determine the structure without knowing your query patterns. I find that the query patterns often dictates the document structure.
As per my data model, I need to store many to many relationship data items in dynamodb
Example :
I have a field called studentId and every studentId will have several subjects assigned to him.
Requirement :
So that for a given studentId, I need to store all the subjects. I would need to get all the subjects assigned to a given student.
Similary, for a given subjectId, I need to know the studentIds whom that subject has been assigned to.
am planning to store this in dynamoDb as follows :
Table1 : StudentToSubjects :
Hash Key : StudenId,
RangeKey: subjectId
so that if I query using only primaryKey, it would give me all the rows having that primary key and all the different hash keys.
Secondary Key as
secondary HashKey: subjectId
Secondary RangeKey: studentId
I wanted to know if this makes sense or the right thing to do. Or there are better ways to solve this problem.
Your Design looks OK but you need to think it through before finalizing it, let say you have implemented this design and after 10 years when you will query the table for particular subject, you will get all the students of past 10 years which you might not need (when you query using secondary table-GSI).
I would probably go with following
Student Master:
Hash Key: studentId
subjectIds (Number-set or String-set)
Subject Master:
Hash Key: subjectId
Range Key: Year
studentIds (Number-set or String-set)
Advantage of this would be you will consume less queries, for particular subject or student you will consume only 1 read (if the size is less then 4kb).
Again this is just scratching a surface think of all the queries before finalizing the Schema.
Edit: You don't need to repeat the studentId it will remain unique.
it would look something like this
studentId -> s1
subjectIds -> sub1,sub2,subN (This is set)
studentId -> s2
subjectIds -> sub3,sub4
Following is the data type link you can refer http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataModel.html#DataModel.DataTypes