DynamoDB: Avoid duplicating data in many-to-many relationship - amazon-dynamodb

I am practicing DynamoDB applying STD (single table design) and I am having some troubles to design my scheme.
Given the simple many-to-many relationship "A Club can have N players while a player can play in M clubs".
And supposing the following two AP (access patterns):
Get all the players from a Club.
Get all clubs where a player plays.
The AP 1 can be solved by the following schema:
Pseudo-Query:
GET * WHERE PK = 'CLUB#C1'
However, I am not figuring a good way to solve the AP 2. I could a GSI and have the following new scheme:
But, despite I can get the Clubs IDs, I am not retrieving the Club's information.
I have read AWS's dcumentation the Adjacency list design pattern
But, as far as I understand, it is quite different from my example since it is not querying Invoice's specific information for each bill.
In my case, I do need the Club's specific information shared by all players.
The only way I have figured out to do so (in a single query) is: for each Club-Player relation, store both entities' information.
Considering that both Club and Player have not static data that may change over time, how would I handle updates? Is it expected to do N updates each time an attribute change?

You can't always expect to get all of the data with a single lookup. You want all the clubs that a player plays for, which you can get from your GSI. Then you have 2 choices, wait until a user clicks on a club at fetch that information with a GetItem, or load all the clubs needed beforehand with a BatchGetItem.

Related

DynamoDB modelling question: table with games between player1 and player2, how to get all games involving a given player

Am storing games in a database. Games are between two players: call them player1 and player2. I have a document per game, with keys 'player1' and 'player2' containing the player ids. Obviously a given player could appear in either the player1 or player2 key depending on the draw.
Is there a way to structure my data so that I can efficiently find all games for a given player? I know that a query where player1=playerId OR player2=playerId is not possible in dynamo. Am looking for ideas on how to manage it. I started by creating "linked" documents with playerId as the partition key and date/time of game as the sort key. But this is getting messy!
Maybe my best option is to create two GSIs (on player1 and player2) and do an application level union.
Thanks
If I'm reading your question correctly, your access patterns are
Fetch games for a player
Fetch most recent game for a player (based on comments on an earlier post)
Let's start by modeling the relationship between two players playing a game. I'll call it a Match (naming is hard). You could store the Matches between players like this:
I've made up a few attributes on the Match item to illustrate the concept. I'm using a simple primary key with the format MATCH#[KSUID]. If you're not familiar, a KSUID is a unique identifier that is sortable and has a built-in time component. You can use them like UUIDs, but get the useful side effect of sortability based on time. This feature of KSUIDs will be useful when retrieving the latest Match.
We can create two secondary indexes to model the matches from either players perspective. For example, I'll create a secondary index named Player1Index and give it a GSIPK of the player1 attribute, and the GSISK could be the PK of your main table. Using the example above, your data would look like this
Similarly, the Player2Index would look like this
Notice that the KSUID is part of the sort key in both indexes, which means fetching matches by a player will automatically sort the matches in the order they were created. This would allow for your second access pattern where you fetch the latest match for a given player.
EDIT: If the goal is to get all matches where a given player was player1 or player2, you could create a Match item collection that contains each player as a separate item within the collection. For example:
Then you could create an inverted secondary index, swapping the PK/SK patterns in the index. That would look like this:
In this model, the secondary index would contain all matches for a given player, regardless of their role in the match. You may prefer this solution since you could grab the data in a single query with a single index. Pagination would be easier than the first approach.
Whichever path you take, the goal is to pre-join the data you need so it can be fetched in a single query. Sure, you could use the former pattern and query two indexes, and merge results in your application. Making two queries (versus one) isn't the worst thing in the world, but is way less satisfying than fetching the data all at once!

Queryable unbound amount of items

I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.

DynamoDB - Modeling bidirectional many-to-many relationship?

I'm having a hard time modeling a certain scenario without having to appeal to more than one request.
Think about a People table, each Person can be related to eachother n times, and this relationship has a description.
Consider the following modelling :
As you can see, I have two People, and person_0001 is child of person_0002.
Now, in this case, if I want to get all relationships that person_0001 has, it's easy, I just query :
GET WHERE PK = "person_0001" AND SK.BEGINS_WITH("rel")
But, since it is bidirectional, how can I get the relationships person_0002 has?
I could use a GSI that inverts the keys, so with one request I can simply query both tables at once.
But real problem comes when I need to update/delete, How can I delete/update all relationships person_0002 has with only one request? I can only read from GSIs.
It's a big difficulty I have in general, what do I do when I need to do a delete/update/write on a GSI?

DynamoDB larger than 400KB items

I am planning to create a merchant table, which will have store locations of the merchant. Most merchants are small businesses and they only have a few stores. However, there is the odd multi-chain/franchise who may have hundreds of locations.
What would be my solution if I want to put include location attributes within the merchant table? If I have to split it into multiple tables, how do I achieve that?
Thank you!
EDIT: How about splitting the table. To cater for the majority, say up to 5 locations I can place them inside the same table. But beyond 5, it will spill over to a normalised table with an indicator on the main table to say there are more than 5 locations. Any thoughts on how to achieve that?
You have a couple of options depending on your access patterns:
Compress the data and store the binary object in DynamoDB.
Store basic details in DynamoDB along with a link to S3 for the larger things. There's no transactional support across DynamoDB and S3 so there's a chance your data could become inconsistent.
Rather than embed location attributes, you could normalise your tables and put that data in a separate table with the equivalent of a foreign key to your merchant table. But, you may then need two queries to retrieve data for each merchant, which would count towards your throughput costs.
Catering for a spill-over table would have to be handled in the application code rather than at the database level: if (store_count > 5) then execute another query to retrieve more data
If you don't need the performance and scalability of DynamoDB, perhaps RDS is a better solution.
A bit late to the party, but I believe the right schema would be to have partitionKey as merchantId with sortKey as storeId. This would create individual, separate records for each store and you can store the geo location. This way
You would not cross the 400KB threshold
Queries become efficient if you want to fetch the location for just 1 of the stores of the merchant. If you want to fetch all the stores, there is no impact with this schema.
PS : I am a Software Engineer working on Amazon Dynamodb.

Stuck on designing the schema for my firebase database

I come from a SQL background so I've been having a problem designing my NoSQL firebase schema. I'm used to being able to query for anything using the "WHERE" clause, and it seems more difficult to do so in firebase (although the performance EASILY makes up for it!).
I'm storing "track" objects for songs. These objects have key/value pairs such as artist_name, track title, genre, rating, created_date, etc. as below:
tracks
|_____-JPl1zwOzjqoM8xDTFll
|____ artist: "Bob"
|____ title: "so long"
|____ genre: "pop"
|____ rating: 52
|____ created: 1403129692781
|
|_____ -JPv7KnVi8ASQJjRDpvh
|____ artist: "Mary"
|____ title: "im alright now"
|____ genre: "rock"
|____ rating: 70
|____ created: 1403129692787
The default behaviour on my site will be to list all these tracks, with the newest added track appearing at the top of the list. I can set my $priority to be created and just turn it negative (created * -1) to achieve this effect I believe.
But in the future, I'd like to be able to filter/query the list by other means, for example:
Retrieve all tracks that have a genre of rock, pop, or hip-hop.
Retrieve all tracks that have a rating of 80 or higher, and have been added in the last 7 days.
How is it possible to achieve this in firebase? My understanding is that there are really only 2 ways to order data:
Through the "ID" value, which has the physical location of "firebaseURL.firebaseio.com/tracks/id", which in my case, was automatically selected for me when I add a track. This is okay (I think) as I have pages for individual track pages that list details, and the URL on my site is something like "www.mysite.com/tracks/-JPl1zwOzjqoM8xDTFll".
By using a $priority, which in my case, I've used on the "created" value so as to order my list in proper date order.
Given the way I have things set up (and please do let me know if there's a better way), is there a way I can easily query for specific genres, or specific ratings?
I read the blog "Denormalizing your Data is Normal" (https://www.firebase.com/blog/2013-04-12-denormalizing-is-normal.html), and I think I understand it. From what Anant describes, one way to achieve what I want would maybe be to create a new object in firebase for a genre and list all the tracks there, like so:
tracks
|______ All
|_____ -JPlB34tJfAJT0rFT0qI
|_____ -JPlB32222222222T0qI
|_____ -JPlB34wefwefFT0qI
|______ Rock
|_____ -JPlB32222222222T0qI
|_____ -JPlB34tJfAJT0rFT0qI
|______ Pop
|_____ -JPlB34wefwefFT0qI
The premise in the blog, was that hard drive space was cheap, but a user's time is not. Thus, it's okay for there to be duplicate data as it allows for faster reads.
That makes sense, and I wouldn't mind this method. But this would work only if a user wanted to select all tracks from only ONE genre. What if they wanted to get all the tracks from BOTH rock AND pop? Would I have to store another object called Rock&Pop and store a track in there each time someone submits a song of either genre?
genre
|_______pop-rock
|_________ -JPlB34tJfAJT0rFT0qI (a rock song)
|_________ -JPlB34wefwefFT0qI (a pop song)
|_________ -JPlB32222222222T0qI (a rock song)
Also, would it make more sense to store the ENTIRE track object or just a reference using the trackid? So for example, under /genre/pop:
Should I store just the reference?
genre
|______ pop
|______ -JPlB34wefwefFT0qI
Or, Should I store the entire track?
genre
|______ pop
|______ -JPlB34wefwefFT0qI
|___ artist: "bob"
|___ title: "hello"
|___ genre: pop
|___ etc..
Is there a performance difference between the two methods? I'm thinking that maybe the latter one would be faster, as I wouldn't need to query for each individual track for the other details but I just want to be sure.
I've redone my firebase schema several times already. I've made some improvements, but as my application is getting bigger, changing it gets more costly and consumes more time. It'd be nice if I could get these questions cleared up for the final time before I spend a lot of time redoing the rest of my code to match it again..
Thanks for any help with this, it's very much appreciated. And please let me know if you need additional information.
Firebase is rolling out a lot of additions to the query API over the next year. Contextual searching (where foo like bar) is probably never going to be a big hit in real-time data--it's slow and cumbersome.
There is a two-part blog article on sql queries and equivalent patterns in Firebase. I'd recommend you give it a read-through. Part 2, in particular, talks about Flashlight.
Why ElasticSearch and a service? Like real-time data storage and synchronization, search is a complex topic with a lot of boilerplate and discoverable complexity. It's easy to write a where clause in SQL and that will get you a ways, but it quickly falls short of user expectations.
ES can be integrated with Firebase in a snap (the Flashlight service took less than 5 minutes to integrate with an app, last time I attempted it), and provides robust and thorough search capabilities.
So until Firebase rolls out some game-changing features around querying, I'd suggest checking out this approach at the start, rather than trying to bolt on search capabilities by another means.
In your examples above, you build different hierarchies and store a little bit of the data, but just put the IDs in as keys. So when you get that onto the client, you'll probably still end up sorting by some of the data fields.
I like to let Firebase handle the sorting for me by using multi-part keys.
For instance, if I needed to access tracks by genre and artist name, I'd make a flat index node called tracksByGenreAndArtist, with a key composed of genre_name + artist_name + track_name + track_id. The value would be an object with artist name, artist id, track name, and track id. Adding the id is just to ensure it will be unique.
Now all the data is accessible in order of genre, artist, and track name. You could even do a predictive search against it, it's so fast.
Assume the user has selected the genre "Rock", and she types a 'B' into the search box. You could populate the predictions dropdown by grabbing the first ten tracks by artists whose name starts with 'B':
indexRef.orderByKey().startAt('Rock'+'B').limitToFirst(10);
Use the partial data object you've stored at that location to show the name of the artist and the track in the dropdown.
If the user selects a prediction, then use the track id to retrieve the full track object from your tracks node and artist id to retrieve the full artist object from the artists node.
If the user types a different letter, then just toss your predictions and do another predictive fetch, e.g.,
indexRef.orderByKey().startAt('Rock'+'Br').limitToFirst(10);
Also, to your question of what to do if you needed to search both Rock and Pop genres? Well, you can do two queries like the ones above pretty quickly
indexRef.orderByKey().startAt('Rock'+'Br').limitToFirst(10);
indexRef.orderByKey().startAt('Pop'+'Br').limitToFirst(10);
You could group them separately in your predictive dropdown: The first ten from Rock, followed by the first ten from pop. If that isn't perfomant enough for you, you could always make a lot of combinatorial indexes with the same tiny data objects and every unique combination of genre that could be selected as a search filter, I suppose. Still, that "disk is cheap but the user's time isn't" is your guiding maxim here.

Resources