DynamoDB throttling even with spare capacity - amazon-dynamodb

I am getting throttled update requests on a DynamoDB table even though there is provisioned capacity to spare.
What could be causing this?
I have a hunch that this must be related to "hot keys" in the table, but I want to get an opinion before going down that rabbit-hole. If that is the problem, any suggestions on tools or processes to help visualize/debug the issue would be appreciated.

Frequent updates to the same hash key but different range key?
i.e. userId + timeStamp
userId = Hash Key
timeStamp = Range Key
e.g.
user1 + 2016-06-23:23:00:01
user1 + 2016-06-23:23:00:02
user1 + 2016-06-23:23:00:03
user1 + 2016-06-23:23:00:04
user1 + 2016-06-23:23:00:05
This causes hotkeys.
There is not a non-invasive techniques that i know of. If you have access to the code base to make changes, i suggest logging the hash and range key, this is one way you can determine if you are having hot rows / hot keys.

Related

How to fix ItemCollectionSizeLimitExceededException for this scenario?

This is a question in a certifcation which has me stumped: if a dynamodb table is getting requests from two gaming apps App A is sending 5,00,000 requests per second and App B is sending 10,000 requests per second and each request is 20KB Users are complaining about ItemCollectionSizeLimitExceededException
Current design
Primary Key : is game_name and sort key event identifier (uid)
LSI: player_id and event_time
What aould be the correct choice? It looks like LSI is the problem here but am no 100% certain.
Choice A Use the player identifier as the partition key. Use the event time as the sort key. Add a global secondary index with the game name as the partition key and the event time as the sort key.
Choice BCreate one table for each game. Use the player identifier as the partition key. Use the event time as the sort key

Querying dynamo by time intervals

I'm new to DynamoDB and would like some help on how to best structure things, and whether it's the right tool for the job.
Let's say I have thousands of users signed up to receive messages. They can choose to receive messages every half hour, hour, couple of hours or every 4 hours. So essentially there is a schedule attribute for each user's message. Users can also specify a time window for when they receive these messages, e.g. 09:00 - 17:00 and also toggle an active state.
I want to be able to easily get the messages to send to the various users at the right time.
If done in SQL this would be really easy, with something like:
Select * from UserMessageSchedules
where
now() > startTime
and now() < endTime
and userIsActive
and schedule = 'hourly'
But I'm struggling to do something similar in DynamoDB. At first I thought I'd have the following schema:
userId (partion Key)
messageId (sort key)
schedule (one of half_hour, hour, two_hours, four_hours)
startTime_userId
endTime
I'd create a Global Secondary Index with the 'schedule' attribute being the partition key, and startTime + userId being the sort key.
I could then easily query for messages that need sending after a startTime.
But I'd still have to check endTime > now() within my lambda. Also, i'd be reading in most of the table, which seems inefficient and may lead to throughput issues?
And with the limited number of schedules, would I get hot partitions on the GSI?
So I then thought rather than sending messages from a table designed to store users preferences, I could process this table when an entry is made/edited and populate a toSend table, which would look like this:
timeSlot (pk) timeSlot_messageId (sk)
00:30 00:30_Message1_Id
00:30 00:30_Message2_Id
01:00 01:00_Message1_Id
Finding the messages to send at certain time would be nice and fast as I'd just query on the timeSlot
But again I'm worried about hot spots and throughput. Is it ok for each partition to have 1000's rows and for just that partition to be read? Are there any other problems with this approach?
Another possibility would be to have different tables (rather than partitions) for each half hour when something could be sent
e.g, toSendAt_00:30, toSendAt_01:00, toSendAt_01:30
and these would have the messageId as the primary key and would contain the data needing to be sent. I'd just scan the table. Is this overkill?
Rather than do big reads of data every half an hour, would I be better duplicating the data into Elastic Search and querying this?
Thanks!

Recommended Schema for DynamoDB calendar/event like structure

I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.

DynamoDB multi tenant - partition key

I have read in a blog that I could "make" a dynamodb table multi tenant using the tenant id as the partition key and for e.g. the sort key as the customer id.
It's sounds good, but imagine that I have a big workload for a tenant id = X, so I am going to have big workload on the same partition.
Is it better to create a hash key that is the concatenate the tenantid + customerid, so i will not have a hotspot?
Yes, you can, depending on your access pattern.
Whenever you want to Get or Query items from a DynamoDB table, you need to provide the exact partition-key. If you don't do that, you can only Scan, which is a costly operation.
If you'll mostly be interested in data at tenant-id + customer-id>, then it makes sense to make it partition-key. If you won't have customer-id, then you should keep tenant-id as partition-key.

Slow making many aggregate queries to a very large SQL Server table

I have a custom log/transaction table that tracks my users every action within the web application and it currently has millions of records and grows by the minute. In my application I need to implement some of way of precalculating a user's activities/actions in sql to determine whether other features/actions are available to the user within the application. For one example, before a page loads, I need to check if the user viewed a page X number of times.
(SELECT COUNT(*) FROM MyLog WHERE UserID = xxx and PageID = 123)
I am making several similar aggregate queries with joins for checking other conditions and the performance is poor. These checks are occuring on every page request and the application can receive hundreds of requests per minute.
I'm looking for any ideas to improve the application performance through sql and/or application code.
This is a .NET 2.0 app and using SQL Server 2008.
Much thanks in advance!
Easiest way is to store the counts in a table by themselves. Then, when adding records (hopefully through an SP), you can simply increment the affected row in your aggregate table. If you are really worried about the counts getting out of whack, you can put a trigger on the detail table to update the aggregated table, however I don't like triggers as they have very little visibility.
Also, how up to date do these counts need to be? Can this be something that can be stored into a table once a day?
Querying a log table like this may be more trouble then it is worth.
As an alternative I would suggest using something like memcache to store the value as needed. As long as you update the cache on each hit it will much faster the querying a large database table. Memcache has an build in increment operator that handles this kind of thing.
This way you only need to query the db on the first visit.
Another alternative is to use a precomputed table, updating it as needed.
Have you indexed MyLog on UserID and PageID? If not, that should give you some huge gains.
Todd this is a tough one because of the number of operations you are performing.
Have you checked your indexes on that database?
Here's a stored procedure you can execute to help at least find valid indexes. I can't remember where I found this but it helped me:
CREATE PROCEDURE [dbo].[SQLMissingIndexes]
#DBNAME varchar(100)=NULL
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
SELECT
migs.avg_total_user_cost * (migs.avg_user_impact / 100.0)
* (migs.user_seeks + migs.user_scans) AS improvement_measure,
'CREATE INDEX [missing_index_'
+ CONVERT (varchar, mig.index_group_handle)
+ '_' + CONVERT (varchar, mid.index_handle)
+ '_' + LEFT (PARSENAME(mid.statement, 1), 32) + ']'
+ ' ON ' + mid.statement
+ ' (' + ISNULL (mid.equality_columns,'')
+ CASE WHEN mid.equality_columns IS NOT NULL
AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END
+ ISNULL (mid.inequality_columns, '')
+ ')'
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,
migs.*,
mid.database_id,
mid.[object_id]
FROM
sys.dm_db_missing_index_groups mig
INNER JOIN
sys.dm_db_missing_index_group_stats migs
ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid
ON mig.index_handle = mid.index_handle
WHERE
migs.avg_total_user_cost
* (migs.avg_user_impact / 100.0)
* (migs.user_seeks + migs.user_scans) > 10
AND
(#DBNAME = db_name(mid.database_id) OR #DBNAME IS NULL)
ORDER BY
migs.avg_total_user_cost
* migs.avg_user_impact
* (migs.user_seeks + migs.user_scans) DESC
END
I modified it a bit to accept a db name. If you dont provide a db name it will run and give you information about all databases and give you suggestions on what fields need indexing.
To run it use:
exec DatabaseName.dbo.SQLMissingIndexes 'MyDatabaseName'
I usually put reusable SQL (Sproc) code in a seperate database called DBA then from any database I can say:
exec DBA.dbo.SQLMissingIndexes
As an example.
Edit
Just remembered the source, Bart Duncan.
Here is a direct link http://blogs.msdn.com/b/bartd/archive/2007/07/19/are-you-using-sql-s-missing-index-dmvs.aspx
But remember I did modify it to accept a single db name.
We had the same problem, beginning several years ago, moved from SQL Server to OLAP cubes, and when that stopped working recently we moved again, to Hadoop and some other components.
OLTP (Online Transaction Processing) databases, of which SQL Server is one, are not very good at OLAP (Online Analytical Processing). This is what OLAP cubes are for.
OLTP provides good throughput when you're writing and reading many individual rows. It fails, as you just found, when doing many aggregate queries that require scanning many rows. Since SQL Server stores every record as a contiguous block on the disk, scanning many rows means many disk fetches. The cache saves you for a while - so long as your table is small, but when you get to tables with millions of rows the problem becomes evident.
Frankly, OLAP isn't that scalable either, and at some point (tens of millions of new records per day) you're going to have to move to a more distributed solution - either paid (Vertica, Greenplum) or free (HBase, Hypertable).
If neither is an option (e.g. no time or no budget) then for now you can alleviate your pain somewhat by spending more on hardware. You need very fast IO (fast disks, RAID), as as much RAM as you could get.

Resources