I have an application on AWS using DynamoDB with user sending messages to each other. I am not familiar with AWS and I a lacking best practice knowledge
My application has now started to get slow to retrieve messages for a user because I have more and more data in my database.
I am thinking that it is because of my primary key and I wonder what could be a good primary key in this case.
Currently I am using a random guid as a primary key.
I am looking to retrieve all messages corresponding to a user, I am doing a scan operation.
I would like to use a composite value based on username as a primary key but I wonder if it will be better. For instance if I need to retrieve the number of messages for a user and to increment it will probably be even longer to do the request to create the primary key.
What would be a good primary key here ?
Thanks!
It will be better since it appears you often query based on the userid. Scans are expensive and should be avoided where possible. AWS has a great article on best practices for choosing a partition key (primary key). The key takeaway is the following:
You should evaluate various approaches based on your data ingestion and access pattern, then choose the most appropriate key with the least probability of hitting throttling issues.
Using a guid for the partition/primary key is a waste if you never query the data using it. Since using the query operation (rather than using scan) requires querying using the partition/primary (and sort key), you want to ensure you choose a value that you use to retrieve the data often and also has the sufficient cardinality to ensure your data is distributed across a reasonable amount of partitions.
What other access patterns do you have in your application? From what you've mentioned so far, userid seems to be a reasonable choice.
Related
Background: I have a relational db background and have never built anything for DynamoDB that wasn't just used for fast writes with very few reads. I am trying to learn DynamoDB patterns by migrating one of my help desk apps from MySQL to DynamoDB.
The application is a fairly simple one from a data storage perspective. A user submits a request and that request generates 1 or more tickets.
Setup: I have screens where people see initial requests and that request's tickets and search views that allow support to query on a bunch of attributes of a ticket (last name of user, status of ticket, use case of ticket, phone number of user, dept of user). This design in a SQL db is pretty straightforward but in Dynamo, I'm really being thrown for a loop on how to structure primary/sort keys and secondary indexes (if necessary).
I created one collection for requests and one collection for tickets. The individual requests have an array of ticket ids that belong to it. The ticket item has an attribute that stores the request id so that I can search that way. But what I am hung up on, is how do I incorporate searching on a ticket/request's attributes without having to do a full scan?
I read about composite keys and perhaps creating a composite sort key similar to: ## so that I can search on each of those fields directly without having to know the primary key (ticket id).
Question: How do you design dynamo collections/tables that require querying a lot of different attribute values without relying on a primary key?
This is typically something that DynamoDB is not good at, not to say it definitely cannot be done. The strength and speed for DynamoDB comes from having well known access patterns and designing your schema for these patterns. In general if you don't know what your users will search for, or there are many different possible queries, it's better to look at something like RDS or a native SQL DB. That being said a possible direction to solve this could be to create multiple lists for each of the fields and duplicate the data. This could all be done in the same table.
I'm new to "DynamoDB" and wanting to know best practice to maintaining unique partition key value when you add records to a table.
With my existing experience related to SQL, primary keys are normally maintained by the system with identity columns or via a trigger. I've searched through various forums and "AWS" documentation, but did not find any specifics. Do you manually determine the existence of partition key value or am I missing something obvious?
In DynamoDB the querying is flexibility is limited when compared to SQL. So the schema as well as partition key / sort key should be designed to make the most common and important queries as fast as possible. You can find some generic best practices here
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/
If you can provide better context on the use case that you are trying to use DynamoDB, you should get more pointed answere
I'm a MSSQL developer who recently was tasked with building a new application using DynamoDB since we use AWS and we wanted a highly scaleable database service.
My biggest concern is data integrity. For example, I have a table for all my users where every row needs to have a username, email, and name field, all strings, with a verified field that's an int. Is there anyway to require all entries in that table to have those fields and to be of that particular type?
Since the application is in PHP I'm using Kettle as my ORM which should prevent me from messing up the data integrity but another developer voiced a concern about if we ever add another application or if someone manually changes some types via the console.
https://github.com/inouet/kettle
Currently, no, you are responsible for maintaining the integrity of your items with respect to the existence of attributes that are not keys on the base table. However, you can use LSI and GSI to enforce data types of attributes (notwithstanding my qualm that this is not a recommended pattern, as it could cause partition heat especially for attributes whose range of values is small). For example, verified seems like it might take only 0 or 1 as a value, so if you create a GSI with PK=verified where verified is a Number, writes to the base table may get throttled by the verified GSI.
We are looking to use AWS DynamoDB for storing application logs. Logs from multiple components in our system would be stored here. We are expecting a lot of writes and only minimal number of reads.
The client that we use for writing into DynamoDB generates a UUID for the partition key, but using this makes it difficult to actually search.
Most prominent search cases are,
Search based on Component / Date / Date time
Search based on JobId / File name
Search based on Log Level
From what I have read so far, using a UUID for the partition key is not suitable for our case. I am currently thinking about using either / for our partition key and ISO 8601 timestamp as our sort key. Does this sound reasonable / widely used setting for such an use case ?
If not kindly suggest alternatives that can be used.
Using UUID as partition key will efficiently distribute the data amongst internal partitions so you will have ability to utilize all of the provisioned capacity.
Using sortable (ISO format) timestamp as range/sort key will store the data in order so it will be possible to retrieve it in order.
However for retrieving logs by anything other than timestamp, you may have to create indexes (GSI) which are charged separately.
Hope your logs are precious enough to store in DynamoDB instead of CloudWatch ;)
In general DynamoDB seems like a bad solution for storing logs:
It is more expensive than CloudWatch
It has poor querying capabilities, unless you start utilising global secondary indexes which will double or triple your expenses
Unless you use random UUID for hash key, you are risking creating hot partitions/keys in your db (For example, using component ID as a primary or global secondary key, might result in throttling if some component writes much more often than others)
But assuming you already know these drawbacks and you still want to use DynamoDB, here is what I would recommend:
Use JobId or Component name as hash key (one as primary, one as GSI)
Use timestamp as a sort key
If you need to search by log level often, then you can create another local sort key, or you can combine level and timestamp into single sort key. If you only care about searching for ERROR level logs most of the time, then it might be better to create a sparse GSI for that.
Create a new table each day(let's call it "hot table"), and only store that day's logs in that table. This table will have high write throughput. Once the day finishes, significantly reduce its write throughput (maybe to 0) and only leave some read capacity. This way you will reduce risk of running into 10 GB limit per hash key that Dynamo DB has.
This approach also has an advantage in terms of log retention. It is very easy and cheap to remove log older than X days this way. By keeping old table capacity very low you will also avoid very high costs. For more complicated ad-hoc analysis, use EMR
Why does aspnet_users use guid for id rather than incrementing int?
Also is there any reason no to use this in other tables as the primary key? It feels a bit odd as I know most apps I've worked with in the past just use the normal int system.
I'm also about to start using this id to match against an extended details table for extra user prefs etc. I was also considering using a link table with a guid and an int in it, but decided that as I don't think I actually need to have user id as a public int.
Although I would like to have the int (feels easier to do a user lookup etc stackoverflow.com/users/12345/user-name ) , as I am just going to have the username, I don't think I need to carry this item around, and incure the extra complexity of lookups when I need to find a users int.
Thanks for any help with any of this.
It ensures uniqueness across disconnected systems. Any data store which may need to interface with another previously unconnected datastore can potentially encounter collisions - e.g. they both used int to identify users, now we have to go through a complex resolution process to choose new IDs for the conflicting ones and update all references accordingly.
The downside to using a standard uniqueidentifier in SQL (with newid()) as the primary key is that GUIDs are not sequential, so as new rows are created they are inserted at some arbitrary position in the physical database page, instead of appended to the end. This causes severe page fragmentation for systems that have any substantial insert rate. It can be corrected by using newsequentialid() instead. I discussed this in more detail here.
In general, its best practice to either use newsequentialid() for your GUID primary key, or just don't use GUIDs for the primary key. You can always have a secondary indexed column that stores a GUID, which you can use to keep your references unique.
GUIDs as a primary key are quite popular with certain groups of programmers who don't really (don't want to or don't know to) care about their underlying database.
GUIDs are cool because they're (almost) guaranteed to be unique, and you can create them "ahead of time" on the client app in .NET code.
Unfortunately, those folks aren't aware of the terrible downsides (horrendous index fragmentation and thus performance loss) of those choices when it comes to SQL Server. Lots of programmer really just don't care one bit..... and then blame SQL Server for being slow as a dog.
If you want to use GUIDs for your primary keys (and they do have some really good uses, as Rex M. pointed out - in replication scenarios mostly), then OK, but make sure to use a INT IDENTITY column as your clustering key in SQL Server to minimize index fragmentation and thus performance losses.
Kimberly Tripp, the "Queen of SQL Server Indexing", has a bunch of really good and insightful articles on the topic - see some of my favorites:
GUIDs as Primary and/or clustering key
The clustered index key debate continues....
The clustered index key debate....again!
Indexes in SQL Server 2005/2008 Best Practices
and basically anything she ever publishes on her blog is worth reading.