Interpreting output of .show capacity command - azure-data-explorer

I am trying to make sense of output return .show capacity command. I am finding that different clusters I have access to have the same capacity policy, yet when I run .show capacity command I see different number in 'Total' column of the resultset. Isn't total determined from capacity policy?
Also, what it means when we say for example remaining capacity for 'DataExport' resource is say 30 ? Does it mean that 30 more export commands can be accommodated (all with their unique OperationsId) without getting queued up (if it queues up at all when more export commands are issued than 'Remaining' slots) ?

based on the scope property, the output of .show capacity may depend not only on the cluster's capacity policy, but also on the cluster's workload groups and their request rate limit policies.
Even if unaltered, both the former and the latter would be different by default in clusters that have different SKUs (different number of cores per node), or a different number of nodes (total number of cores).
30 means that up to 30 export commands may run concurrently. the 31st will be throttled.
It doesn't necessarily mean the cluster can't physically handle more than 30, rather it's a (configurable) threshold at which requests of this type will be throttled, to limit the export workload from consuming too many resources.
There's no queuing of such requests (unless your workload group definition and its request queueing policy specify that queuing is enabled for these kinds of requests)

Related

Kusto ingestion limit and command throttle because of capacity policy

I use kusto ingest client kustoClient.IngestFromDataReader to ingest data. And it throws exception An error occurred for source: 'DataReader'. Error: 'Failed to ingest: State='Throttled', Status='The control command was aborted due to throttling. Retrying after some backoff might succeed. CommandType: 'DataIngestPull', Capacity: 18, Origin: 'CapacityPolicy/Ingestion'.'. I read the document here https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/capacitypolicy#ingestion-capacity, and guess it may because that there are too many requests run concurrently and the cluster capacity is limited, am I right?
I am still a bit confused about the document. What does the final number (Minimum(ClusterMaximumConcurrentOperations, Number of nodes in cluster * Maximum(1, Core count per node * CoreUtilizationCoefficient))) mean? Does it means the total concurrent operations number? And specifically does one kusto ingest client or one kusto ingest command only have one concurrent operation or it is configurable?
Thanks a lot!
Effectively the document means that ingest capacity (in terms of concurrent ingest operations) is 3/4 times the overall number of cores in the cluster, but not higher than 512.
You can view your cluster capacity and its utilization by running '.show cluster capacity' command.
If you do not want to handle the throttling by yourself, you should use the KustoQueuedIngestClient class, and pass to it the ingestion service endpoint (https://ingest-..kusto.windows.net).
The ingestion service will take care of managing the load on your cluster.
See Ingestion Overview article for more details.

ADX request throttling improvements

I am getting {"code": "Too many requests", "message": "Request is denied due to throttling."} from ADX when I run some batch ADF pipelines. I have came across this document on workload group. I have a cluster where we did not configured work load groups. Now i assume all the queries will be managed by default workload group. I found that MaxConcurrentRequests property is 20. I have following doubts.
Does it mean that this is the maximum concurrent requests my cluster can handle?
If I create a rest API which provides data from ADX will it support only 20 requests at a given time?
How to find the maximum concurrent requests an ADX cluster can handle?
for understanding the reason your command is throttled, the key element in the error message is this: Capacity: 6, Origin: 'CapacityPolicy/Ingestion'.
this means - the number of concurrent ingestion operations your cluster can run is 6. this is calculated based on the cluster's ingestion capacity, which is part of the cluster's capacity policy.
it is impacted by the total number of cores/nodes the cluster has. Generally, you could:
scale up/out in order to reach greater capacity, and/or
reduce the parallelism of your ingestion commands, so that only up to 6 are being run concurrently, and/or
add logic to the client application to retry on such throttling errors, after some backoff.
additional reference: Control commands throttling

Why does Cosmos DB return 429 for a portion of requests despite not exceeding my manual set throughput

My Cosmos DB is using Shared Throughput across several containers. I have manually scaled up my Cosmos DB to 70,000 RU/s and I am currently running a large number of requests.
Looking in azure I can see that a portion of my requests are being throttled (returning 429).
To give an idea of numbers around 25k requests return 200 and around 5k requests return 429.
When I follow the warning in the azure portal that says my collection is exceeding provisioned throughput it shows the average throughput is 6.78k RU/s.
I don't understand why when I have 70,000 RU/s that my requests are being throttled when the average throughput is supposedly only 6,780 RU/s.
No other containers are being read or written to, all these requests are made against just one container.
As all these requests are to run a stored procedure they all have a Partition key supplied.
The most likely reason is you have a hot partition that is reaching its allocated throughput before the other partitions are.
For a horizontally scalable database, throughput is allocated across physical partitions (computers) and data is partitioned using a partition key that basically acts as an address to route it to a specific computer to be stored.
Assume I have a collection with three partitions 1, 2, 3 and 30K RU/s. Each one of those will get 10K RU/s allocated to it. If I then run an operation that does a ton of operations on partition 2 and consumes all of it's 10K I'm going to get rate limited (429) even I don't touch partition 1 or 3.
To avoid this you need to pick a partition key that BOTH distributes data as evenly as possible during writes and ideally can also be used to answer queries within one or a small number (bounded) number of partitions, trying to avoid "fan out" queries where queries have to hit every partition.
Now for small collections that only reside on a single physical partition none of this matters because your data is all on a single physical partition. However, as the collection grows larger this causes issues which will prevent the database from scaling fully.
You can learn more here

DynamoDB Update/Put throttled despite high provisioned capacity

I am seeing some throttles on my updates on DynamoDB table. I know that throttle work on per second basis, that peaks above provisioned capacity can be sometimes absorbed, but not guaranteed. I know that one is supposed to evenly distribute the load, which I have not done.
BUT please look at the 1 minute average graphs from metrics; attached. The utilized capacity is way below the provisioned capacity. Where are these throttles coming from? Because all writes went to a particular shard?
There are no batch writes. The workload distribution is something that cannot, easily, control.
DynamoDB is built on the assumption that to get the full potential out of your provisioned throughput your reads and writes must be uniformly distributed over space (hash/range keys) and time (not all coming in at the exact same second).
Based on the allocated throughput on your graphs you are still most likely at one shard, but it is possible that there are two or more shards if you have previously raised the throughput above the current level and lowered it down to what it is at now. While this is something to be mindful of, it likely is not what is causing this throttling behavior directly. If you have a lot of data in your table, over 10 GB then you definitely will have multiple shards. This would mean you likely have a lot of cold data in your table and that may be causing this issue, but that seems less likely.
The most likely issue is that you have some hot keys. Specifically, you have one or just a few records that are receiving a very high number of read or write requests and this is resulting in throttling. Essentially DynamoDB can support massive IOPS for both writes and reads, but you can't apply all of those IOPS to just a few records, they need to be distributed among all of the records uniformly in an ideal situation.
Since the number of throttles you were showing is in the order of magnitude of 10s to 100s it may not be something to worry about. As long as you are using the official AWS SDK it will automatically take care of retries with exponential backoff to retry requests several times before completely giving up.
While it is difficult in many circumstances to control the distribution of reads and writes to a table, it may be worth taking another look at your hash/range key design to make sure it is really optimal for your pattern of reads and writes to the table. Also, for reads you may employ caching through Memcached or Redis, even if the cache expired in just a few minutes or a few seconds to help reduce the impact of hot keys. For writes you would need to look at the logic in the application to make sure there are not any unnecessary writes being performed that could be causing this issue.
One last point related to batch writes: A batch operation in DynamoDB does not reduce the consumed amount of read or writes the different child requests consume, it simply reduces the overhead of making multiple HTTP requests. While batch requests generally help with throughput, they are not useful at reducing the likelihood of throttling in DynamoDB.

How DynamoDB provisions throughput of reads independently of writes

Amazon DynamoDB allows the customer to provision the throughput of reads and writes independently. I have read the Amazon Dynamo paper about the system that preceded DynamoDB and read about how Cassandra and Riak implemented these ideas.
I understand how it is possible to increase the throughput of these systems by adding nodes to the cluster which then divides the hash keyspace of tables across more nodes, thereby allowing greater throughput as long as access is relatively random across hash keys. But in systems like Cassandra and Riak this adds throughput to both reads and writes at the same time.
How is DynamoDB architected differently that they are able to scale reads and write independently? Or are they not and Amazon is just charging for them independently even though they essentially have to allocate enough nodes to cover the greater of the two?
You are correct that adding nodes to a cluster should increase the amount of available throughput but that would be on a cluster basis, not a table basis. The DynamoDB cluster is a shared resource across many tables across many accounts. It's like an EC2 node: you are paying for a virtual machine but that virtual machine is hosted on a real machine that is shared among several EC2 virtual machines and depending on the instance type, you get a certain amount of memory, CPU, network IO, etc.
What you are paying for when you pay for throughput is IO and they can be throttled independently. Paying for more throughput does not cause Amazon to partition your table on more nodes. The only thing that cause a table to be partitioned more is if the size of your table grows to the point where more partitions are needed to store the data for your table. The maximum size of the partition, from what I have gathered talking to DynamoDB engineers, is based on the size of the SSDs of the nodes in the cluster.
The trick with provisioned throughput is that it is divided among the partitions. So if you have a hot partition, you could get throttling and ProvisionedThroughputExceededExceptions even if your total requests aren't exceeding the total read or write throughput. This is contrary to what your question ask. You would expect that if your table is divided among more partitions/nodes, you'd get more throughput but in reality it is the opposite unless you scale your throughput with the size of your table.

Resources