DynamoDB modelling question: table with games between player1 and player2, how to get all games involving a given player - amazon-dynamodb

Am storing games in a database. Games are between two players: call them player1 and player2. I have a document per game, with keys 'player1' and 'player2' containing the player ids. Obviously a given player could appear in either the player1 or player2 key depending on the draw.
Is there a way to structure my data so that I can efficiently find all games for a given player? I know that a query where player1=playerId OR player2=playerId is not possible in dynamo. Am looking for ideas on how to manage it. I started by creating "linked" documents with playerId as the partition key and date/time of game as the sort key. But this is getting messy!
Maybe my best option is to create two GSIs (on player1 and player2) and do an application level union.
Thanks

If I'm reading your question correctly, your access patterns are
Fetch games for a player
Fetch most recent game for a player (based on comments on an earlier post)
Let's start by modeling the relationship between two players playing a game. I'll call it a Match (naming is hard). You could store the Matches between players like this:
I've made up a few attributes on the Match item to illustrate the concept. I'm using a simple primary key with the format MATCH#[KSUID]. If you're not familiar, a KSUID is a unique identifier that is sortable and has a built-in time component. You can use them like UUIDs, but get the useful side effect of sortability based on time. This feature of KSUIDs will be useful when retrieving the latest Match.
We can create two secondary indexes to model the matches from either players perspective. For example, I'll create a secondary index named Player1Index and give it a GSIPK of the player1 attribute, and the GSISK could be the PK of your main table. Using the example above, your data would look like this
Similarly, the Player2Index would look like this
Notice that the KSUID is part of the sort key in both indexes, which means fetching matches by a player will automatically sort the matches in the order they were created. This would allow for your second access pattern where you fetch the latest match for a given player.
EDIT: If the goal is to get all matches where a given player was player1 or player2, you could create a Match item collection that contains each player as a separate item within the collection. For example:
Then you could create an inverted secondary index, swapping the PK/SK patterns in the index. That would look like this:
In this model, the secondary index would contain all matches for a given player, regardless of their role in the match. You may prefer this solution since you could grab the data in a single query with a single index. Pagination would be easier than the first approach.
Whichever path you take, the goal is to pre-join the data you need so it can be fetched in a single query. Sure, you could use the former pattern and query two indexes, and merge results in your application. Making two queries (versus one) isn't the worst thing in the world, but is way less satisfying than fetching the data all at once!

Related

AWS DynamoDB Naming Convention

I am trying to create a naming convention for different objects in DynamoDB, such as tables, partition and sort keys, LSIs, GSIs, attributes, etc. I read a lot of articles and there is no common way to do that but want to learn from real-time examples to choose which one will fit best our needs.
The infrastructure I am working on is based on microservices. Along with this, some of our development environments share the same AWS account. Based on this, I ended up with something like this:
Tables: [Environment].[Service Name].[Table Name].ddb-table
GSIs/LSIs: [Environment].[Service Name].[Table Name].[GSI/LSI Name].ddb-[gsi/lsi]
Partition Key: pk ??? (in my understanding, the keys should have abstract names, because the single table stores versatile data in the same key)
Sort Key: sk ??? (in my understanding, the keys should have abstract names, because the single table stores versatile data in the same key)
Attributes: meaningful but as short as possible as they are kept for every item in the table
Different elements are separated by dot (.)
All names are separated by dashes (kebab-case) and in lower case
Tables/GSIs/LSIs are in singular form
Here is an example:
Table: dev.user-service.user-order.ddb-table
LSI: dev.user-service.user-order.lsi1pk.ddb-lsi
GSI: dev.user-service.user-order.gsi1pk.ddb-gsi
What naming conventions do you follow?
Thanks a lot in advance!
My advice:
Use PK and SK as your partition key and sort key.
Don't put table names into code. Use ParameterStore. For example, if you ever do a table restore it will be to a new table name, and if you want to send traffic to the new name you'll not want to change code.
Thus don't get too fixed to any particular table name. Never try to have code predict a table name. Only have them be consistent to help humans.
Don't put regions in your table names. When you switch to Global Tables they all keep the same name. Awkward!
GSIs can be called GSI1, GSI2, etc. GSI keys are GSI1PK and GSI1SK, etc.
Tag your tables with their name if you ever want to track per-table costs later.
Short yet meaningful attribute names are nice because it reduces storage and can reduce RCU/WCU if you're near the 4kb or 1kb lines.
Use difference accounts for dev, staging, and production. If you want to put the names into tables as well to help you spot "OMG I'm in production" that's fine.
If you have lots of attributes as the item payload which aren't used for GSIs or filtering and are always returned together, consider just storing them as a string or binary which gets parsed client side. You can even compress them. It's more efficient and lower latency because it skips the data marshaling.

Should I make this field a GSI, a regular attribute, or something else in order to have efficient queries?

For my DynamoDB table, I currently have a schema like this:
Partition key - Unique ID, so every item has a completely unique ID
Sort key - none
Attribute - JSON that contains some values
Now, I want to add a new field that will be required for every item and will indicate the specific region (e.g. NA-1, NA-2, JP-1, and so on) and I want to be able to do queries on just this field. For example, I might want to perform a query on my table to retrieve all items with the region NA-1.
My question is should I make this field a GSI? I'm new to DynamoDB so I've been researching online and it seems that using a GSI is preferred when that field may only be present for select items in the table, but my field will be required for every item, so I think using a GSI is not an option.
The other possible option I've seen is performing a scan operation and using a filter expression, but from what I've seen, that's a costly operation because DynamoDB has to look at the entire table part-by-part and then filter afterwards. My table isn't very big right now, but it may become quite large in the future, so I would like a scalable option.
TL;DR Is there someway I can add a mandatory regionID field to my table and perform efficient queries on it? What are some good options I should look into?
Yeah, a GSI might not be the best fit here. Maybe you can somehow make it part of the partition key?
Yes. Perform 2 writes on the table. First row will be what you are currently writing, and the second row will have your region as the partition key. Do not forget use transactions as it is possile that one of the writes does not succeed.
While you can use GSI, you have to realize that it is eventual consistent. It will take some time to update it and you might get inconsistent data if you query soon enough after writing.
DynamoDB is a distributed data-store i.e. it stores the data not in a single server but does partitions using the provided partition key (PK). This means your data is spread across multiple servers and brings the limitation that you can query a single partition at a time.
Coming back to your query pattern,
retrieve all items with the region X
You need to add region-id as an attribute in the main table and make it part of the GSI. Do note that to avoid conflicts you need to make the GSI SK a composite SK.
I would recommend using <region>#<unique-id>
This way you can query the GSI like,
where BEGINS_WITH ('X', SK)
Also, if any of your entry moves to a new region or a new entry is created in a region, it will automatically reflect in the GSI and your query results

How to query on more than 2 attributes in DynamoDB using GSI?

I have a use-case where i have to query on more than 2 attributes on dynamoDB table. As far as I know, we can only query for upto 2 attributes(partition key, sort key) on DDB table using GSI. is there anything which allows us to query on multiple attribute(say invoiceId, clientId, invoiceStatus) using GSI.
Yes, this is possible, but you need to take into account every access pattern you want to support when you design your table.
This topic has been discussed at re:Invent multiple times. Here is an video from a few years ago https://youtu.be/HaEPXoXVf2k?t=2102 but similar talks have been given on the topic every year.
Two main options are using composite keys or query filters.
Composite keys are very powerful and boil down to making new 'synthetic' keys that simply concatenate other fields that you have in your record and then using these in your GSI.
For example, if you have a client where you want to be able to get all of their open invoice but also want to be able to get an individual invoice you could use clientId as the partition key and concatenate invoiceStatus and invoiceId together as the sort key. You can then use begins_with to only have certain invoice status returned. In this example, you'd get the have to know the invoiceStatus and invoiceId making this not the best example.
The composite key pattern is also useful for dates as you can use greater than or less than to search certain time ranges. However, it is also possible just to directly get the records with the concatenation.
An alternative design is using query filters. This is less efficient as DynamoDB will have to scan every record that matches the partition and sort key. However, the filter can be applied to any attribute and reduces the amount of data transmitted from DynamoDB to your application. This is useful when your main keys are mostly selective, but multiple matches are possible and the filter gets you the rest of the way there.
The other aspect of using a GSI that can help reduce cost is projecting only the attributes you care about. When a record is updated the GSI only updates if one of the projected attributes is updated. By keeping the GSI skinny it makes the previously listed strategies more cost effective.

Pagination with Filtering using Query Operation in DynamoDB Template

I would like to be able to filter a pagination result using query operation before the limit is taken into consideration.Is there any suggestion to get right pagination on filtered results?
I would like to implement a DynamoDB Scan OR Query with the following logic:
Scanning -> Filtering(boolean true or false) -> Limiting(for pagination)
However, I have only been able to implement a Scan OR Query with this logic:
Scanning -> Limiting(for pagination) -> Filtering(boolean true or false)
Note: I have already tried Global Secondary Index but it didn't work in my case Because I have 5 different attributes to filter and limit.
Unfortunatelly DynamoDB is not capable to do this, once you do Query on one of your indexes, it will read every single item that satisfies your partition and sort key.
Lets check your example - You have boolean and you have index over that field. Lets say 50% of items are false and 50% are true. Once you search by that index you will read through 50% of all items in table (so its almost like SCAN). If you set up limit, it will read only that number of items and then it stops. You cannot use the combination of limit and skip/page/offset like in other databases.
There is some level of pagination https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html but it does not allow you to jump to i.e. page 10, it only allows you go through all the pages one by one. Also I am not sure how it is priced, maybe internally the AWS will go through all the items before preparing the results for you, so you will pay for reading 50% of whole table even if you stop iterating before you reach the end.
There is also the limitation that index can have maximum of 2 fields (partition, sort).
EXAMPLE
You wrote that you have 5 parameters you want to query. The workaround that is used to address these limitations is to create and manage extra fields that have combination of parameters you want to query. Lets say you have table of users and you have there gender, age, name, surname and position. Lets say its huge database, so you have to think about amount of data you can load. Then if you want to use DynamoDB, you have to think about all queries you want to do.
You most likely want to search by name and surname, so you create index with surname as partition key and name as sort key (in such case you can search by surname or by both surname and name). It can work for lot of names, but you found out that some name combinations are too common and you need to filter by position as well. In such case, you create new field (column) called i.e. name-surname and whenever you create or update item, you will need to handle this field in your app to make sure it contains both of it, i.e. will-smith. Then you can make another index, that has name-surname as partition key and position as sort key. Now you can use it for such searches.
However you found out, that for some name-surname-position combination you get too many results and you dont want to handle it on application level and you want to limit results by age as well. Then you can create index with name-surname-position as partition key and age as sort key. At this moment you can also figure out that your old name-surname field and index can be removed as it server no purposes anymore (name and surname are handled by another index and for searching just name-surname-position you can use this index)
You want to query by gender as well sometimes? Its probably better to handle that in application level (or extra filter in db query) rather than creating new index that must be handled and payed for. There are only two types of gender (ok, lets say there exists more, but 99% of people will have just male or female) so its probably cheaper to just hide few fields on application level if someone wants to check only male/female/transgenders..., but load all of them. Because for extra index you would have to pay for every single insert, but this filter will be used only from time to time. Also when someone searches already by name, surname and position you dont expect that much results anyway, so if you get 20 (all genders) or just 10 (male only) results does not make much difference.
This ^^ was just example of how you can think and work with DynamoDB. How exactly you use it depends on your business logic.
Very important note: DynamoDB is very simple database that can only do very simple queries. It has little more functionality than Redis but a lot less functionality than traditional databases. The valid result of thinking about your business model/use-cases is that maybe you should NOT use the DynamoDB at all, because it can simply not satisfy your needs and queries.
Some basic thinking can look like this:
Is key-value persistant storage enough? Use DynamoDB
Is key-value persistant storage, where one item can have multiple keys and I can search and filter by maximum of 2 fields enough? Use DynamoDB
Is persistant storage, where I want to search single Table/Collection by many multiple keys with lot of options enough? Use MongoDB
Do I need to search through multiple tables or do complex joins or need transactions? Use traditional SQL database

Filtering results with Geofire + Firebase

I'm trying to figure out how to query with filter with Geofire.
Suppose I have restaurants with different category. and I want to add that category to my query. How do I go about this?
One way I have now is querying the key with Geofire, run the for loop through each key and get the restaurant, and insert the appropriate restaurant to the array.
These seems so inefficient. Is there any other way to go about this?
Ideally I will have the filtered results, and only load each item when they're about to be shown.
Cheers!
Firebase queries can only filter by one condition. Geofire already does quite some "magic" to allow it to filter on both longitude and latitude. Adding another property to that equation might be possible, but is well beyond what Geofire handles by default. See GeoFire: How to add extra conditions within the query?
If you only ever want to access one category at a time, you can put the restaurants in a top-level node per category and point Geofire to one category.
/category1
item1
g: "pns0h0mf2u"
l: [-53.435719, 140.808716]
item2
g: "u417k3dwub"
l: [56.83069, 1.94822]
/category2
item3
g: "8m3rz3s480"
l: [30.902225, -166.66809]
/items
item1: ...
item2: ...
item3: ...
In the above example, we have two categories: category1 with 2 items and category2 with just 1 item. For each item, we see the data that Geofire uses: a geohash and the longitude and latitude. We also keep a single list with the other properties of these 3 items.
But more commonly, you simply do the extra filtering in client-side code. If you're worried about the performance of that: measure it, share the code, JSON data and measurements.
This is an old question, but I've seen it in a few places on the web, so I thought I might share one trick I've used.
The Problem
If you have a large collection in your database, maybe containing hundreds of thousands of keys, for example, it might not be feasible to grab them all. If you're trying to filter results based on location in addition to other criteria, you're stuck with something like:
Execute the location query
Loop through each returned geofire key and grab the corresponding data in the database
Check each returned piece of data to see if it matches the other criteria
Unfortunately, that's a lot of network requests, which is quite slow.
More concretely, let's say we want to get all users within e.g. 100 miles of a particular location that are male and between ages 20 and 25. If there are 10,000 users within 100 miles, that means 10,000 network requests to grab the user data and compare their gender and age.
The Workaround:
You can store the data you need for your comparisons in the geofire key itself, separated by a delimiter. Then, you can just split the keys returned by the geofire query to get access to the data. You still have to filter through them, but it's much faster than sending hundreds or thousands of requests.
For instance, you could use the format:
UserID*gender*age, which might look something like facebook:1234567*male*24. The important points are
Separate data points by a delimiter
Use a valid character for the delimiter -- "It can include any unicode characters except for . $ # [ ] / and ASCII control characters 0-31 and 127.)"
Use a character that is not going to be found elsewhere in your database - I used *, but that might not work for you. Do not use any characters from -0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz, since those are fair-game for keys generated by firebase's push()
Choose a consistent order for the data - in this case, UserID first, then gender, then age.
You can store up to 768 bytes of data in firebase keys, which goes a long way.
Hope this helps!

Resources