DynamoDB Modeling Multiple Query Elements - amazon-dynamodb

Background: I have a relational db background and have never built anything for DynamoDB that wasn't just used for fast writes with very few reads. I am trying to learn DynamoDB patterns by migrating one of my help desk apps from MySQL to DynamoDB.
The application is a fairly simple one from a data storage perspective. A user submits a request and that request generates 1 or more tickets.
Setup: I have screens where people see initial requests and that request's tickets and search views that allow support to query on a bunch of attributes of a ticket (last name of user, status of ticket, use case of ticket, phone number of user, dept of user). This design in a SQL db is pretty straightforward but in Dynamo, I'm really being thrown for a loop on how to structure primary/sort keys and secondary indexes (if necessary).
I created one collection for requests and one collection for tickets. The individual requests have an array of ticket ids that belong to it. The ticket item has an attribute that stores the request id so that I can search that way. But what I am hung up on, is how do I incorporate searching on a ticket/request's attributes without having to do a full scan?
I read about composite keys and perhaps creating a composite sort key similar to: ## so that I can search on each of those fields directly without having to know the primary key (ticket id).
Question: How do you design dynamo collections/tables that require querying a lot of different attribute values without relying on a primary key?

This is typically something that DynamoDB is not good at, not to say it definitely cannot be done. The strength and speed for DynamoDB comes from having well known access patterns and designing your schema for these patterns. In general if you don't know what your users will search for, or there are many different possible queries, it's better to look at something like RDS or a native SQL DB. That being said a possible direction to solve this could be to create multiple lists for each of the fields and duplicate the data. This could all be done in the same table.

Related

Are multiple dynamoDB queries in a single API request bad practice

I'm trying to create my first DynamoDB based project and I'm having some trouble figuring out the best practices working with a NoSQL database.
My usecase currently is storing users and teams. I have a table that has a partition key of either USER#{userId} or TEAM{#teamId}. If the PK is TEAM{#teamId} I store records with SK either TEAM#{teamId} for team details, or USER#{userId} for the user's details in the team (acceptedInvite, joinDate etc). I also have a GSI based on the userId/email column that allows me to query all the teams a user has been invted to, or the user's team, depending on the value of acceptedInvite field. Attached screenshots of the table structure at the moment:
The table
The GSI
In my application I have an access pattern of getting a team's team members, given a user id.
Currently, I'm doing two queries in my lambda function:
Get user's team, by querying the GSI on PK = {userId} and fitler acceptedInvite = true
Get the team data by querying the table on PK = {teamId} and SK begins_with USER#
This works fine, but I'm concerned I need to preform two separate DynamoDB calls in my API function.
I'm wondering if there's a better way to represent this access pattern and if multiple dynamoDB calls are actually that bad, since I cannot see another way to do this.
Any kind of feedback is appreciated!
The best way to avoid making two queries like this is to supply the API caller with all the information needed to make a single DynamoDB request. For your case this means supplying the caller with the teamId. You can do this as either as part of a list operation response, or if it is the authenticated user, then as part of their claims in a JWT.

Can we avoid scan in dynamodb

I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?
Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…

Lookup the existence of a large number of keys (up to1M) in datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Is there a way to enforce a schema constraint on an AWS DynamoDB table?

I'm a MSSQL developer who recently was tasked with building a new application using DynamoDB since we use AWS and we wanted a highly scaleable database service.
My biggest concern is data integrity. For example, I have a table for all my users where every row needs to have a username, email, and name field, all strings, with a verified field that's an int. Is there anyway to require all entries in that table to have those fields and to be of that particular type?
Since the application is in PHP I'm using Kettle as my ORM which should prevent me from messing up the data integrity but another developer voiced a concern about if we ever add another application or if someone manually changes some types via the console.
https://github.com/inouet/kettle
Currently, no, you are responsible for maintaining the integrity of your items with respect to the existence of attributes that are not keys on the base table. However, you can use LSI and GSI to enforce data types of attributes (notwithstanding my qualm that this is not a recommended pattern, as it could cause partition heat especially for attributes whose range of values is small). For example, verified seems like it might take only 0 or 1 as a value, so if you create a GSI with PK=verified where verified is a Number, writes to the base table may get throttled by the verified GSI.

sql server database design

I am planning to create a website using ASP.NET and SQL Server. However, my plan for the database design leaves me wondering if there is a better way.
The website will serve as a repository of information for various users. I figure I would have two databases, a Membership and Profile database.
The profile database would contain user data for all users, where each user may have ~20 tables. I would create the tables when the user account is created and generate a key used to name the tables. The tables are not directly related.
For Example a set of tables for two different users could look like:
User1 Tables - TransactionTable_Key1, AssetTable_Key1, ResearchTable_Key1 ....;
User2 Tables - TransactionTable_Key2, AssetTable_Key2, ResearchTable_Key2 ....;
The Key1, Key2 etc.. values would be retrieved based on the MembershipID data when the account was created. This could result in a very large number of tables over time. I'm not sure if this will limit scalability by setting up the database in this way. Any recommendations?
Edit: I should mention that some of these tables would contain 20k+ rows.
Realistically it sounds like you only really need one database for this.
From the way you worded your question, it sounds like you're trying to dynamically create tables for users as they create accounts. I wouldn't recommend this method.
What you want to do is create a master table that contains a primary key for each individual user. I'm assuming this is the Membership table. Then create the ~20 tables that you need for the profiles of these members. Every record, no matter the number of users that you have, will go into these tables. These 20 tables would need to have a foreign key pointing to the unique identifier of the Membership table.
When you want to query a Member for their user information, just select from the tables where the membership table's primary Id matches the foreign key in the profile tables.
This would result in only a few tables in the end and is easily maintainable and follows better database design.
Your ORM layer (EF, LINQ, DAL code) will hate having to deal with one set of tables per tenant. It is much better to have either one set of tables for all tenant in a single database, or a separate database per tenant. The later is only better if schema upgrade has to be vetted by tenant (like Salesforce.com has). If you can afford to upgrade all tenant to a new schema at once then there is no reason for database per tenant.
When you design a schema that hold multiple tenant the important things to remember are
don't use heaps, all tables must be clustered index
add the tenant ID as the leftmost key to every clustered
add the tenant ID as the leftmost key to every non-clustered index too
add the Left.tenantID = right.tenantID predicate to every join
add the table.TenantID = #currentTenantID to every query
These are fairly simple rules and if you obey them (with no exceptions) you will get a perfect partitioning per tenant of every query (no query will ever ever scan rows in a range of a different tenant) so you eliminate contention between tenants. To be more through, you can disable lock escalation to make sure no tenant escalates to block every other tenant.
This design also lends itself to table partitioning and to sharing the database for scale-out.
You definitely don't want to create a set of tables for each user, and you would want these only in one database. Even with SQL Server 2008's large capacity for tables (note really total objects in database), it would quickly become unmanageable. Your best bet is to use 20 tables, and separate them via a column into user areas. You might consider partitioning the tables by this user value, but that should be tested for performance reasons too.
Yes, since the tables only contain id, key, and value, why not make one single table?
Have the columns:
id, user ID, key, value
Put an Index on the user ID field.
A key idea behind a relational database is that the table structure does not change. You create a solid set of tables, and these are the "bones" of your application.
Cheers,
Daniel
Neal,
The solution really depends on your requirement. If security and data access are concern and you have only a handful of users, you can set up a different db for each user with access for him set to only his/her database.
Other wise, what Daniel Williams suggested is a good alternative where you have one DB and tables laid out with a indexed column partitioning the users data rows.
It's hard to tell from the summary, but it looks like you are designing for dynamic attribution by user. This design approach is called EAV (Entity-Attribute-Value) and consists of a simple base collection key (UserID, SiteID, ProductID...) and then rows consisting of name/value pairs. In a more complex version, categories are sometimes added as "super columns" to the tuple/row and provide sub-groupings for a set of name/value pairs.
Designing in this way moves responsibility for data type integrity, relational integrity and tuple integrity to the application layer.
The risk with doing this in a relational system involves the breaking of the tuple or row into a set of rows. Updates, deletes, missing values and the definition of a tuple are no longer easily accessible through human interaction. As your application evolves and the definition of a tuple changes, it becomes almost impossible to tell if a name/value pair is missing because it's part of an earlier-version tuple or because it was unintentionally deleted. Ad-hoc research as well becomes harder to manage as business analysts must keep an understanding of the virtual structure either in their heads or in documentation provided.
If you are looking to implement an EAV model, I would suggest you look at a non-relational solution (nosql) like MongoDB or CouchDB. These stores allow a developer to save and retrieve "documents" or json-formatted messages that are essentially made up of a collection of name/value pairs and can look very much like a serialized object. The advantage here is that you can store dynamic attribution without breaking your tuple. You always know that you have a complete tuple because you can store and retrieve it as a single "blob" of information that can be serialized and deserialized at-will. You can also update single attributes within the tuple, if that's a concern.
MongoDB also provides some database-like features such as multiple-attribute indexes, a query engine that is robust in comparison to other similar non-relational offerings and a sharding solution that is much less trouble than trying to do it with MySQL.
I hope this helps.

Resources