I'm running a simple api that gets an item from a dynamodb table on each call, I have auto scaling set to a minimum of 25 and a maximum of 10 000.
However if I send 15 000 requests with a tool like wrk or hey, I get about 1000 502s,
dynamodb's metrics show that reads are throttled
the scaling activities log on the table shows that the RCUs were scaled to 99 but not more than that
lambda logs show that the function starts to take longer, it usually takes about 20ms to run, but the function starts to run for 500.1500,3000 ms and start timing out (I'm assuming that's caused by the throttling)
Why isn't the autoscaling working better? It only scales upto 99RCUs but my max is 10, 000.
We ran into the same problem when testing DynamoDB autoscaling for short amounts of time, and it turns out the problem is that the scaling events only happen after 5 minutes of elevated throughput (you can see this by inspecting the CloudWatch alarms the autoscaling sets up)
This excellent blog post helped us solve this by creating a Lambda that responds to the CloudWatch API events and improves the responsiveness of the alarms to one minute: https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b
from: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
What you defined as "target utilization"?
Target utilization is the ratio of consumed capacity units to provisioned capacity units, expressed as a percentage. Application Auto Scaling uses its target tracking algorithm to ensure that the provisioned read capacity of ProductCatalog is adjusted as required so that utilization remains at or near 70 percent.
also, i think that the main reason that autoscale not works for you, is because your work might not stay elevated for a long time:
"DynamoDB auto scaling modifies provisioned throughput settings only when the actual workload stays elevated (or depressed) for a sustained period of several minutes"
DynamoDB auto scaling modifies provisioned throughput settings only when the actual workload stays elevated (or depressed) for a sustained period of several minutes. The Application Auto Scaling target tracking algorithm seeks to keep the target utilization at or near your chosen value over the long term.
Sudden, short-duration spikes of activity are accommodated by the table's built-in burst capacity. For more information, see Use Burst Capacity Sparingly.
Related
I need to feed the db with something like 10k items, I don't need to rush and can/want stay below the 25wcu free plan.
It will take something like 6.5minutes (10000requests/25requests/sec).
Here is the question. I will loop on a json to feed the base do I have to handle the number of request by second myself or can I push to the max and it will be queued ?
I read that I may just have an error message (400?) when I exceed the limit can I just brutally retry the failed requests (eventually making more fail) until my 10k items are put in the db ?
tldr; => best way/strategy to feed the base knowing there is a limit of calls/sec
ps: it's run from a lambda idk if it matters.
Your calculation is a little bit off unless you do this constantly, because AWS actually allows you to burst a little bit (docs):
DynamoDB provides some flexibility in your per-partition throughput
provisioning by providing burst capacity. Whenever you're not fully
using a partition's throughput, DynamoDB reserves a portion of that
unused capacity for later bursts of throughput to handle usage spikes.
DynamoDB currently retains up to 5 minutes (300 seconds) of unused
read and write capacity. During an occasional burst of read or write
activity, these extra capacity units can be consumed quickly—even
faster than the per-second provisioned throughput capacity that you've
defined for your table.
Since 300 (seconds) * 25 (WCU) = 7500 this leaves you with about 7.5k items until it will actually throttle.
Afterwards just responding to the ProvisionedThroughputExceeded error by retrying later is fine - but make sure to add a small delay between retries (e.g. 1 second) as you know that it takes time for the new WCU to flow into the tocken bucket. Immediately retrying and hammering the API is not a kind way to use the service and might look like a DoS attack.
You can also write the items in Batches to reduce the amount of network requests as network throughput is also limited in Lambda. This makes handling the ProvisionedThroughputExceeded error slightly more cumbersome, because you need to inspect the response which items failed to write, but will probably be a net positive.
I have an application using DynamoDB and I noticed they just implemented autoscaling which is awesome. I love the concept and the timing for my app is pretty perfect. However I am still getting some issues that I wonder if I can't tweak settings to remove.
My application gets definite spikes in usage so I think this is an ideal thing to use, however with autoscaling on I still am getting some throttling. Here is my read graphs for the last 12 hours:
As you can see, when it spikes the usage is set low, so it throttles for a minute or two until the update kicks in, then works. That's ok I guess and better than not scaling, but I would like it not to throttle at all...
Is there any way to tell DynamoDB to never throttle unless it goes over 100 (or 200 or whatever I set as the top limit)? Just if it gets a surge turn up the throughput for 15 minutes or whatever until the surge is over?
Autoscaling uses CloudWatch. You can see these alarms by going to the CloudWatch dashboard and look for the alarms that include your table name and "DO NOT EDIT OR DELETE" in the description.
Why am I telling you this?
Well, the CloudWatch has some minimal period granularity. Currently it's 1 minute. It means that it will wait at least for 1 minute before firing any event to its listeners. Therefore, it will take at least one minute after the load starts until the capacity is increased. Actually it will be even more, since increasing the capacity also takes time. Bottom line: if you have very large spike, some requests may be throttled, since the the auto-scaling will not yet take effect and bursting can be exhausted.
The simple, but costly, solution will be increasing the initial capacity.
If you know about the upcoming spike in advance (e.g. you have some job running periodically, or customer peaks in certain times) you can use API to modify autoscaling programmatically.
I have a requirement where I need to initialise my dynamodb table with large volumne of data. Say around 1M in 15 min so I ll have to provision WCU to 10k but after that my load is ~1k per second so I ll decrease WCU to 1k from 10k . Is there any performance drawback or issues in decreasing WCU.
Thanks
In general, assuming the write request doesn't exceed the write capacity units (i.e. as you have not mentioned the item size), there should not be any performance issue.
If at any point you anticipate traffic growth that may exceed your
provisioned throughput, you can simply update your provisioned
throughput values via the AWS Management Console or Amazon DynamoDB
APIs. You can also reduce the provisioned throughput value for a table
as demand decreases. Amazon DynamoDB will remain available while
scaling it throughput level up or down.
Consider this scenario:-
Assume the item size is 1.5KB in size.
First, you would determine the number of write capacity units required per item, rounding up to the nearest whole number, as shown following:
1.5 KB / 1 KB = 1.5 --> 2
The result is two write capacity units per item. Now, you multiply this by the number of writes per second (i.e. 1K per second).
2 write capacity units per item × 1K writes per second = 2K write capacity units
In this scenario, the DynamoDB would throw error code 400 on your extra requests.
If your application performs more reads/second or writes/second than
your table’s provisioned throughput capacity allows, requests above
your provisioned capacity will be throttled and you will receive 400
error codes. For instance, if you had asked for 1,000 write capacity
units and try to do 1,500 writes/second of 1 KB items, DynamoDB will
only allow 1,000 writes/second to go through and you will receive
error code 400 on your extra requests. You should use CloudWatch to
monitor your request rate to ensure that you always have enough
provisioned throughput to achieve the request rate that you need.
Yes, there is a potential impact.
Once you write at high TPS, the more no. of partitions gets created which cannot be reduced later on.
If this number was higher than what is needed eventually for application to run well, this can cause problems.
Read more about DDB partitions for same.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html
I need to find out the peak read capacity units consumed in the last 20 seconds in one of my dynamo DB table. I need to find this pro-grammatically in java and set an auto-scaling action based on the usage.
Please can you share a sample java program to find the peak read capacity units consumed in the last 20 seconds for a particular dynamo DB table?
Note: there are unusual spikes in the dynamo DB requests on the database and hence needs dynamic auto-scaling.
I've tried this:
result = DYNAMODB_CLIENT.describeTable(recomtableName);
readCapacityUnits = result.getTable()
.getProvisionedThroughput().getReadCapacityUnits();
but this gives the provisioned capacity but I need the consumed capacity in last 20 seconds.
You could use the CloudWatch API getMetricStatistics method to get a reading for the capacity metric you require. A hint for the kinds of parameters you need to set can be found here.
For that you have to use Cloudwatch.
GetMetricStatisticsRequest metricStatisticsRequest = new GetMetricStatisticsRequest()
metricStatisticsRequest.setStartTime(startDate)
metricStatisticsRequest.setEndTime(endDate)
metricStatisticsRequest.setNamespace("AWS/DynamoDB")
metricStatisticsRequest.setMetricName('ConsumedWriteCapacityUnits',)
metricStatisticsRequest.setPeriod(60)
metricStatisticsRequest.setStatistics([
'SampleCount',
'Average',
'Sum',
'Minimum',
'Maximum'
])
List<Dimension> dimensions = []
Dimension dimension = new Dimension()
dimension.setName('TableName')
dimension.setValue(dynamoTableHelperService.campaignPkToTableName(campaignPk))
dimensions << dimension
metricStatisticsRequest.setDimensions(dimensions)
client.getMetricStatistics(metricStatisticsRequest)
But I bet you'd results older than 5 minutes.
Actually current off the shelf autscaling is using Cloudwatch. This does have a drawback and for some applications is unacceptable.
When spike load is hitting your table it does not have enough capacity to respond with. Reserved with some overload is not enough and a table starts throttling. If records are kept in memory while waiting a table to respond it can simply blow the memory up. Cloudwatch on the other hand reacts in some time often when spike is gone. Based on our tests it was at least 5 mins. And rising capacity gradually, when it was needed straight up to the max
Long story short. We have created custom solution with own speedometers. What it does is counting whatever it has to count and changing tables's capacity accordingly. There is a still a delay because
App itself takes a bit of time to understand what to do
Dynamo table takes ~30 sec to get updated with new capacity details.
On a top we also have a throttling detector. So if write/read request has got throttled we immediately rise a capacity accordingly. Some times level of capacity looks all right but throttling because of HOT key issue.
I am creating an online crowd driven game. I expect the read/write requests to fluctuate (like, 50,50,50,1500,50,50,50)every second and I need to process all 100% requests with strong consistency.
I am planning to go with AWS's DynamoDB from GAE datastore for its strong consistency. I have the below doubts which I could not get clear answers in other discussions.
1. If the item size for a write action is just 4B, Will that be rounded to a 1KB and consume a write unit?
2. Financially it is not wise to set the Provisioned Throughput Capacity around the expected peak value. Alarms can warn us. But in the case of sudden rise, the requests could be throttled at the time we receive alarm. Is DynamoDB really designed to handle highly fluctuating read/write?
3. I read about Dynamc DynamoDB to update the read/write throughput capacity for us, When we add some read/write units, How long it will take to allocate them? If it takes too long, Whats the use of increasing the bar after the tide hits?
Google app engine bills just for the number of requests happen in that month. If I can make AWS work like, "Whatever the request count could be, I will expand and contract myself and charge you only for the used read/write units", I will go for AWS.
Please advise. Dont hesitate if I am not being clear at parts.
Thanks,
Karthick.
Yes. Item sizes are rounded up and the throughput is used. From the Provisioned Throughput in Amazon DynamoDB documentation:
The total number of read operations necessary is the item size, rounded up to the next multiple of 4 KB, divided by 4 KB.
It can handle some bursting, but it is generally intended to be used for uniform workloads. Here is a section from the Guidelines for Working with Tables documentation and some other helpful links about the best practices:
A temporary non-uniformity in a workload can generally be absorbed by the bursting allowance, as described in Use Burst Capacity Sparingly. However, if your application must accommodate non-uniform workloads on a regular basis, you should design your table with DynamoDB's partitioning behavior in mind (see Understand Partition Behavior), and be mindful when increasing and decreasing provisioned throughput on that table.
Query and Scan guidelines for avoiding bursts of read activity
The Table Best Practices section
Use Burst Capacity Sparingly
This one is going to depend on how much data your table has, because DynamoDB will have to repartition the data if you are scaling up. See the Consider Workload Uniformity When Adjusting Provisioned Throughput documentation for more information about the partitioning..