Quick question on modeling data for a customer …
Customer stores Store data, about 250 records, maybe 10 properties each.
Customer stores Department data, about 1,000 record, again, maybe 10 properties each.
Customer stores Product data, about 2,000,000 records, maybe 20 properties each.
My thoughts for modeling this data, based on how it is accessed is to store Store data and Department data in a lookups collection, partitioned on the object property, in this case, Store of Department.
Store the Product data in a products collection, partitioned on the upc_code property.
Does this make sense? Or is there a better way, specifically with handling small (< 1,000 records) datasets, should I recommend Table Storage for any of this?
Thanks in advance!
Yes, that could work. I wouldn't use the Table API for this though. If you want key/value features use SQL API and turn off indexing. But only if you look up product by upc_code.
Another question. Is this data all related? Have you looked at possibly storing this as a graph and using the Gremlin API in Cosmos?
Related
So Im designing currently three tables, an organization, organization_relationships, members.
Organization
OrgID PK
Metdata..
Org_Relationships
ParentOrgID PK
ChildOrgID Range/GSI
Member
OrgID PK
MemberID Range/GSI
One way that I need to access data, is by determining whether two members share a parent organization. With the way this is right now, I would basically have to do a weird search on the tables, that requires multiple calls to the table to determine whether two members belong to the same parent organization. With that being said is there a more efficient way of designing the table to do this without requiring multiple calls to the table.
The reason you're having to perform multiple queries is because you've modeled the relationship across several tables. This is a common approach when using traditional relational databases, but could be considered an anti-pattern with NoSQL databases.
Keep in mind that DynamoDB does not have a join operation like SQL databases. Therefore, it is a best practice to store related data in the same DynamoDB table. This can be counter-intuitive if you're used to working with relational DBs.
There are several ways to model your data in DynamoDB. The approach you choose depends on your access patterns. In other words, you store your data in a way that makes it easier to get the data your application needs.
For example, here's one way to model Users and Organizations:
The primary key is made up of a user id (e.g. USER#) and a sort key of META. This record (called an "item") in DynamoDB is where I'll define various user attributes. In this example, I've provided a name and an org attribute.
For illustrative purposes, I've also created a global secondary index (GSI) that swaps the partition key/sort key pattern in your base table. Your GSI will look like this:
This lets you fetch all users by organization.
If I wanted to check if two users are in the same organization, I can either query the GSI, or fetch both user records and compare the org fields.
This is just an example meant to give you a starting point with NoSQL design. The key takeaways here are:
NoSQL (or non-relational) data modeling is different than SQL (relational) data modeling.
You want to store related data in the same table.
How you store your data depends entirely on how you plan to use the data.
I have a DynamoDB structure as following.
I have patients with patient information stored in its documents.
I have claims with claim information stored in its documents.
I have payments with payment information stored in its documents.
Every claim belongs to a patient. A patient can have one or more claims.
Every payment belongs to a patient. A patient can have one or more payments.
I created only one DynamoDB table since all of aws dynamodb documentations indicates using only one table if possible is the best solution. So I end up with following :
In this table ID is the partition key and EntryType is the sortkey. Every claim and payment holds its owner.
My access patterns are as following :
Listing all patients in the DB with pagination with patients sorted on creation dates.
Listing all claims in the DB with pagination with claims sorted on creation dates.
Listing all payments in the DB with pagination with payments sorted on creation dates.
Listing claims of a particular patient.
Listing payments of a particular patient.
I can achieve these with two global secondary indexes. I can list patients, claims and payments sorted by their creation date by using a GSI with EntryType as a partition key and CreationDate as a sort key. Also I can list a patient's claims and payments by using another GSI with EntryType partition key and OwnerID sort key.
My problem is this approach brings me only sorting with creation date. My patients and claims have much more attributes (around 25 each) and I need to sort them according to each of their attribute as well. But there is a limit on Amazon DynamoDB that every table can have at most 20 GSI. So I tried creating GSI's on the fly (dynamically upon the request) but that also ended very inefficiently since it copies the items to another partition to create a GSI (as far as I know). So what is the best solution to sort patients by their patient name, claims by their claim description and any other fields they have?
Sorting in DynamoDB happens only on the sort key. In your data model, your sort key is EntryType, which doesn't support any of the access patterns you've outlined.
You could create a secondary index on the fields you want to sort by (e.g. creationDate). However, that pattern can be limiting if you want to support sorting by many attributes.
I'm afraid there is no simple solution to your problem. While this is super simple in SQL, DynamoDB sorting just doens't work that way. Instead, I'll suggest a few ideas that may help get you unstuck:
Client Side Sorting - Use DDB to efficiently query the data your application needs, and let the client worry about sorting the data. For example, if your client is a web application, you could use javascript to dynamically sort the fields on the fly, depending on which field the user wants to sort by.
Consider using KSUIDs for your IDs - I noticed most of your access patterns involves sorting by CreationDate. The KSUID, or K-Sortable Globally Unique Id's, is a globally unique ID that is sortable by generation time. It's a great option when your application needs to create unique IDs and sort by a creation timestamp. If you build a KSUID into your sort keys, your query results could automatically support sorting by creation date.
Reorganize Your Data - If you have the flexibility to redesign how you store your data, you could accommodate several of your access patterns with fewer secondary indexes (example below).
Finally, I notice that your table example is very "flat" and doesn't appear to be modeling the relationships in a way that supports any of your access patterns (without adding indexes). Perhaps it's just an example data set to highlight your question about sorting, but I wanted to address a different way to model your data in the event you are unfamiliar with these patterns.
For example, consider your access patterns that require you to fetch a patient's claims and payments, sorted by creation date. Here's one way that could be modeled:
This design handles four access patterns:
get patient claims, sorted by date created.
get patient payments, sorted by date created.
get patient info (names, etc...)
get patient claims, payments and info (in a single query).
The queries would look like this (in pseudocode):
query where PK = "PATIENT#UUID1" and SK < "PATIENT#UUID1"
query where PK = "PATIENT#UUID1" and SK > "PATIENT#UUID1"
query where PK = "PATIENT#UUID1" and SK = "PATIENT#UUID1"
query where PK = "PATIENT#UUID1"
These queries take advantage of the sort keys being lexicographically sorted. When you ask DDB to fetch the PATIENT#UUID1 partition with a sort key less than "PATIENT#UUID1", it will return only the CLAIM items. This is because CLAIMS comes before PATIENT when sorted alphabetically. The same pattern is how I access the PAYMENT items for the given patient. I've used KSUIDs in this scenario, which gives you the added feature of having the CLAIMS and PAYMENT items sorted by creation date!
While this pattern may not solve all of your sorting problems, I hope it gives you some ideas of how you can model your data to support a variety of access patterns with sorting functionality as a side effect.
I am working on a solution that uses a NoSQL backend. My experience is traditionally with relational databases and would like to discuss the best way to store a list of values which may appear in a drop-down from the UI. Traditonally, I would just create a table in my relational DB to store that small set of values and then my records would tie to a specific id representing that value. A simple example of this is a Person table with all of my person records and a Hair color list of values with all the possible hair colors. For each person, a hair color id from that hair color list of values table would be stored in the person record. So a traditional foreign key relationship.
Most of these drop downs are not huge they are small sets (10s of fields) so storing them in their own container within Cosmos seems like overkill. I thought I could also set these values as constants in my API model and manage the values that way. However if those values change I need to do a new build of the API in order to expose those values.
Any thoughts on best practices for how to handle in the NoSQL space? Create a container in the NoSQL backend with the list of values, store the values as constants within my API model or something else?
Appreciate your time considering this question.
In these scenarios for reference data for UI elements I typically recommend storing all of this data in a single Cosmos container. Cosmos is schema agnostic so you can mix/match different schemas of data. If the data is <10GB use a dummy partition key (ie. /pk) with a single value and use a "type" property to distinguish among the different entity types for the data that match your UI elements. Fetch the data using a single query with the pk and then deserialize it into POCO's or whatever hydrates your UI using the type property to distinguish the different UI elements.
You can store this in a container that is part of a shared database. Minimum RU would be 100 RU/s with four containers in a database or 400 RU/s for dedicated container throughput. Which one you choose will depend on how much RU/s the query that fetches this data costs. You can easily get that by running the query in the Azure portal and looking at the query stats.
I am planning to create a merchant table, which will have store locations of the merchant. Most merchants are small businesses and they only have a few stores. However, there is the odd multi-chain/franchise who may have hundreds of locations.
What would be my solution if I want to put include location attributes within the merchant table? If I have to split it into multiple tables, how do I achieve that?
Thank you!
EDIT: How about splitting the table. To cater for the majority, say up to 5 locations I can place them inside the same table. But beyond 5, it will spill over to a normalised table with an indicator on the main table to say there are more than 5 locations. Any thoughts on how to achieve that?
You have a couple of options depending on your access patterns:
Compress the data and store the binary object in DynamoDB.
Store basic details in DynamoDB along with a link to S3 for the larger things. There's no transactional support across DynamoDB and S3 so there's a chance your data could become inconsistent.
Rather than embed location attributes, you could normalise your tables and put that data in a separate table with the equivalent of a foreign key to your merchant table. But, you may then need two queries to retrieve data for each merchant, which would count towards your throughput costs.
Catering for a spill-over table would have to be handled in the application code rather than at the database level: if (store_count > 5) then execute another query to retrieve more data
If you don't need the performance and scalability of DynamoDB, perhaps RDS is a better solution.
A bit late to the party, but I believe the right schema would be to have partitionKey as merchantId with sortKey as storeId. This would create individual, separate records for each store and you can store the geo location. This way
You would not cross the 400KB threshold
Queries become efficient if you want to fetch the location for just 1 of the stores of the merchant. If you want to fetch all the stores, there is no impact with this schema.
PS : I am a Software Engineer working on Amazon Dynamodb.
Let's say I got a table in dynamodb called visits which represents websites visits and one of the columns is the location.
In an RDBMS I would have:
visits [id, website_id, ........, location_id ]
ref_locations [id, city, country, postcode, lat, long]
The query we want to do is get me all the visits for this website (so by website id is fine) but I need the location information per visit. Like the city, the country etc. In Sql this is done with a simple join.
What about DynamoDB? I m thinking that we could store the location as a document in the table (hence denormalizing it completely) but I m sure this isn't the right way.
What do you guys do in this situation?
Thanks
Denormalization is one viable approach. An alternative is to persist the reference table in Dynamo and then cache it in a local data structure (e.g. a Java/C#/Python/whatever Map) or in an in-memory key-value store (e.g. Redis). Denormalization is preferable if the reference data is small and is (almost) completely static (since updates to denormalized data are extremely expensive), whereas caching is preferable if the reference data is moderately large and/or may be updated (in the latter case I recommend using a shared cache such as Redis instead of a per-server data structure as this will make it easier to invalidate/update the cache). (If the reference data is large then you're probably best off just doing a second Dynamo fetch for it, but it doesn't sound like this is the case for your data.)
Regardless of which approach you choose, I suggest comparing benchmarks of storing the reference data as structured data or as a compressed binary - in my experience the reduced storage and network costs of compression are often worth the cpu costs of a g(un)zip (however my experience has primarily been with caching JSON or XML, which get good compression).