DynamoDB TTL Metric - amazon-dynamodb

I have TTL set for 60 minutes. For the past 1 month or so it was working fine, records were deleted within less than 20 mins of TTL expiration. But lately (since this week) some (not all) takes up to 3 hours to delete after TTL expired.
I understand it can take a max of 48 hours, but my customer is asking for prove or justification to the current TTL behavior. Just saying IO workload influences TTL is not enough.
What metric can I use or look at to provide concrete evidence to current TTL's behavior? Is there any benchmark, eg IO load of N will cause N hours of delay.

Unfortunately there is no publicly available way to determine the time it will take for TTL to delete your items. The 48 hours isn't guaranteed either.
You'll find anecdotal evidence online regarding the behavior of TTL under different scenarios (e.g. large tables vs small, other processing happening in your account, etc.), but no official guidance that will answer the question your client is asking.
If your client is unsatisfied the ambiguity around TTL, perhaps they should be exploring other solutions like implementing delete in your application logic.

Related

Amazon DynamoDB throttling

I've enabled auto-scaling for our dynamo-db table. It has a target utilization at 30% but it keeps throttling.
See this example screenshot where throttling is happening
As you can see it's exactly scaling up as you want it too. But I don't understand why it's still throttling. Its almost always below the provisioned throughput.
Can anyone explain what's going wrong and why it's still throttling?
Thanks,
Hendrik
Very hard to tell from the graph, and limited information.
Some thoughts:
AutoScaling can take 5 - 10 minutes to kick in. This is not fast enough if there is a sudden increase in usage. Perhaps you are seeing throttling in that 5 - 10 minute window before it scales up.
If you set CloudWatch Metrics to 1 min interval, you might see whats going on in a big more detail.
As mkobit mentioned, you might be hitting throughput on your Partition, depending on how your data is structured.
Your capacity units are evenly distributed between your partitions. So you may hit the capacity limit on your partition you have and which records you are trying to access, but not your table throughput.
This also depends on the amount of data you have stored, number of partitions etc.
HTH

DynamoDB AutoScaling still throttles

I have an application using DynamoDB and I noticed they just implemented autoscaling which is awesome. I love the concept and the timing for my app is pretty perfect. However I am still getting some issues that I wonder if I can't tweak settings to remove.
My application gets definite spikes in usage so I think this is an ideal thing to use, however with autoscaling on I still am getting some throttling. Here is my read graphs for the last 12 hours:
As you can see, when it spikes the usage is set low, so it throttles for a minute or two until the update kicks in, then works. That's ok I guess and better than not scaling, but I would like it not to throttle at all...
Is there any way to tell DynamoDB to never throttle unless it goes over 100 (or 200 or whatever I set as the top limit)? Just if it gets a surge turn up the throughput for 15 minutes or whatever until the surge is over?
Autoscaling uses CloudWatch. You can see these alarms by going to the CloudWatch dashboard and look for the alarms that include your table name and "DO NOT EDIT OR DELETE" in the description.
Why am I telling you this?
Well, the CloudWatch has some minimal period granularity. Currently it's 1 minute. It means that it will wait at least for 1 minute before firing any event to its listeners. Therefore, it will take at least one minute after the load starts until the capacity is increased. Actually it will be even more, since increasing the capacity also takes time. Bottom line: if you have very large spike, some requests may be throttled, since the the auto-scaling will not yet take effect and bursting can be exhausted.
The simple, but costly, solution will be increasing the initial capacity.
If you know about the upcoming spike in advance (e.g. you have some job running periodically, or customer peaks in certain times) you can use API to modify autoscaling programmatically.

peak read capacity units dynamo DB table

I need to find out the peak read capacity units consumed in the last 20 seconds in one of my dynamo DB table. I need to find this pro-grammatically in java and set an auto-scaling action based on the usage.
Please can you share a sample java program to find the peak read capacity units consumed in the last 20 seconds for a particular dynamo DB table?
Note: there are unusual spikes in the dynamo DB requests on the database and hence needs dynamic auto-scaling.
I've tried this:
result = DYNAMODB_CLIENT.describeTable(recomtableName);
readCapacityUnits = result.getTable()
.getProvisionedThroughput().getReadCapacityUnits();
but this gives the provisioned capacity but I need the consumed capacity in last 20 seconds.
You could use the CloudWatch API getMetricStatistics method to get a reading for the capacity metric you require. A hint for the kinds of parameters you need to set can be found here.
For that you have to use Cloudwatch.
GetMetricStatisticsRequest metricStatisticsRequest = new GetMetricStatisticsRequest()
metricStatisticsRequest.setStartTime(startDate)
metricStatisticsRequest.setEndTime(endDate)
metricStatisticsRequest.setNamespace("AWS/DynamoDB")
metricStatisticsRequest.setMetricName('ConsumedWriteCapacityUnits',)
metricStatisticsRequest.setPeriod(60)
metricStatisticsRequest.setStatistics([
'SampleCount',
'Average',
'Sum',
'Minimum',
'Maximum'
])
List<Dimension> dimensions = []
Dimension dimension = new Dimension()
dimension.setName('TableName')
dimension.setValue(dynamoTableHelperService.campaignPkToTableName(campaignPk))
dimensions << dimension
metricStatisticsRequest.setDimensions(dimensions)
client.getMetricStatistics(metricStatisticsRequest)
But I bet you'd results older than 5 minutes.
Actually current off the shelf autscaling is using Cloudwatch. This does have a drawback and for some applications is unacceptable.
When spike load is hitting your table it does not have enough capacity to respond with. Reserved with some overload is not enough and a table starts throttling. If records are kept in memory while waiting a table to respond it can simply blow the memory up. Cloudwatch on the other hand reacts in some time often when spike is gone. Based on our tests it was at least 5 mins. And rising capacity gradually, when it was needed straight up to the max
Long story short. We have created custom solution with own speedometers. What it does is counting whatever it has to count and changing tables's capacity accordingly. There is a still a delay because
App itself takes a bit of time to understand what to do
Dynamo table takes ~30 sec to get updated with new capacity details.
On a top we also have a throttling detector. So if write/read request has got throttled we immediately rise a capacity accordingly. Some times level of capacity looks all right but throttling because of HOT key issue.

Why the time-out values are so small?

I'm not sure if this is a pure stackoverflow relevant question. It is related to general design practice. Since I cannot think of another relevant stack exchange site, posting it here.
In the general design practice of converting an async call to sync one, we use a time-out and wait for the results. While, this may not exactly a good practice from the point of view of responsiveness, it definitely makes the implementation easier.
I have seen many such implementations and often noticed that the developers tend to give a very small time-out value. I can understand that the people may have the need of a responsive system in mind when they did this. But many of these applications I have seen are very data critical ones where the loss of data is very bad. So, it is always better to wait more and try to get as much data instead of timing out early and giving an error message to the user. Now, the situations where the server failing to give data or the client unable to reach server etc are rare. In those situations, I expect the a large time-out for such waits. After all, these time-outs don't mean that the wait will definitely last until the given time-out value; the timeout value is only an upper limit. So, I have always arguing for higher values here. But I see the use of low values in more and more places and now I'm getting confused if really there is something else in this practice that I don't understand.
So, my question is : Are there any arguments, other than the need for responsiveness to implement a very small time-out for waiting?
As always, the right decision depends on the real-life data.
The timeout should be proportional to the time it usually takes to complete an operation successfully.
Sending a UDP message for example could take between 1 - 50 milliseconds so a timeout of 100 milliseconds is more than reasonable however copying a file over the wire could take minutes or more so a 100 millisecond timeout is laughable.
There are pros and cons to both short and long timeouts so it's a tradeoff. Longer timeouts use more resources (tasks, threads, memory, etc.) for the same amount of work while short timeouts, as you mentioned, may result in loss of data.
In conclusion, you need to set a configurable timeout that sounds reasonable and then figure out whether you timeout too many operations in production or the other way around and calibrate accordingly.

How can I find the average number of concurrent users for IIS to simulate during a load/performance test?

I'm using JMeter for load testing. I'm going through and exercise of finding the max number of concurrent threads (users) that our webserver can handle by simply increasing the # of threads in my distributed JMeter test case, and firing off the test.
Then -- it struck me, that while the MAX number may be useful, the REAL number of users that my website actually handles on average is the number I need to make the test fruitful.
Here are a few pieces of information about our setup:
This is a mixed .NET/Classic ASP site. Upon login, a browser session (with timeout) is created in both for the users.
Each session times out after 60 minutes.
Is there a way using this information, IIS logs, performance counters, and/or some calculation that will help me determine the average # of concurrent users we handle on our production site?
You might use logparser with the QUANTIZE function to determine the peak number of requests over a suitable interval.
For a 10 second window, it would be something like:
logparser "select quantize(to_localtime(to_timestamp(date,time)), 10) as Qnt,
count(*) as Hits from yourLogFile.log group by Qnt order by Hits desc"
The reported counts won't be exactly the same as threads or users, but they should help get you pointed in the right direction.
The best way to do exact counts is probably with performance counters, but I'm not sure any of the standard ones works like you would want -- you'd probably need to create a custom counter.
I can see a couple options here.
Use Performance Monitor to get the current numbers or have it log all day and get an average. ASP.NET has a Requests Current counter. According to this page Classic ASP also has a Requests current, but I've never used it myself.
Run the IIS logs through Log Parser to get the total number of requests and how long each took. I'm thinking that if you know how many requests come in each hour and how long each took, you can get an average of how many were running concurrently.
Also, keep in mind that concurrent users isn't quite the same as concurrent threads on the server. For one, multiple threads will be active per user while content like images is being downloaded. And after that the user will be on the page for a few minutes while the server is idle.
My suggestion is that you define the stop conditions first, such as
Maximum CPU utilization
Maximum memory usage
Maximum response time for requests
Other key parameters you like
It is really subjective to choose the parameters and I personally cannot provide much experience on that.
Secondly you can see whether performance counters or IIS logs can map to the parameters. Then you set up proper mappings.
Thirdly you can start testing by simulating N users (threads) and see whether the stop conditions hit. If not hit, you can go to a higher number. If hit, you can use a smaller number. Recursively you will find a rough number.
However, that never means your web site in real world can take so many users. No simulation so far can cover all the edge cases.

Resources