Kusto ingestion limit and command throttle because of capacity policy - azure-data-explorer

I use kusto ingest client kustoClient.IngestFromDataReader to ingest data. And it throws exception An error occurred for source: 'DataReader'. Error: 'Failed to ingest: State='Throttled', Status='The control command was aborted due to throttling. Retrying after some backoff might succeed. CommandType: 'DataIngestPull', Capacity: 18, Origin: 'CapacityPolicy/Ingestion'.'. I read the document here https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/capacitypolicy#ingestion-capacity, and guess it may because that there are too many requests run concurrently and the cluster capacity is limited, am I right?
I am still a bit confused about the document. What does the final number (Minimum(ClusterMaximumConcurrentOperations, Number of nodes in cluster * Maximum(1, Core count per node * CoreUtilizationCoefficient))) mean? Does it means the total concurrent operations number? And specifically does one kusto ingest client or one kusto ingest command only have one concurrent operation or it is configurable?
Thanks a lot!

Effectively the document means that ingest capacity (in terms of concurrent ingest operations) is 3/4 times the overall number of cores in the cluster, but not higher than 512.
You can view your cluster capacity and its utilization by running '.show cluster capacity' command.
If you do not want to handle the throttling by yourself, you should use the KustoQueuedIngestClient class, and pass to it the ingestion service endpoint (https://ingest-..kusto.windows.net).
The ingestion service will take care of managing the load on your cluster.
See Ingestion Overview article for more details.

Related

When CosmosDb says 10 ms write latency, does it include the time it takes for the write request to reach the cosmosdb server?

This document on cosmosdb consistency levels and latency says CosmosDb writes have a latency of 10ms at the 99th percentile. Does this include the time it takes for the write to reach CosmosDB. I suspect not, since if I issue a request far away from my configured azure regions, I don't see how it can take < 10 ms.
The SLA is for the latency involved in performing the operations and returning results. As you mention, it does not include time taken to reach the Cosmos endpoint, which depends on the client's distance.
As indicated in performance guidance:
You can get the lowest possible latency by ensuring that the calling
application is located within the same Azure region as the provisioned
Azure Cosmos DB endpoint.
In my experience latency <10ms is typical for an app located in the same region as the Cosmos endpoint it works against.

ADX request throttling improvements

I am getting {"code": "Too many requests", "message": "Request is denied due to throttling."} from ADX when I run some batch ADF pipelines. I have came across this document on workload group. I have a cluster where we did not configured work load groups. Now i assume all the queries will be managed by default workload group. I found that MaxConcurrentRequests property is 20. I have following doubts.
Does it mean that this is the maximum concurrent requests my cluster can handle?
If I create a rest API which provides data from ADX will it support only 20 requests at a given time?
How to find the maximum concurrent requests an ADX cluster can handle?
for understanding the reason your command is throttled, the key element in the error message is this: Capacity: 6, Origin: 'CapacityPolicy/Ingestion'.
this means - the number of concurrent ingestion operations your cluster can run is 6. this is calculated based on the cluster's ingestion capacity, which is part of the cluster's capacity policy.
it is impacted by the total number of cores/nodes the cluster has. Generally, you could:
scale up/out in order to reach greater capacity, and/or
reduce the parallelism of your ingestion commands, so that only up to 6 are being run concurrently, and/or
add logic to the client application to retry on such throttling errors, after some backoff.
additional reference: Control commands throttling

DynamoDB: High SuccessfulRequestLatency

We had a period of latency in our application that was directly correlated with latency in DynamoDB and we are trying to figure out what caused that latency.
During that time, the consumed reads and consumed writes for the table were normal (much below the provisioned capacity) and the number of throttled requests was also 0 or 1. The only thing that increased was the SuccessfulRequestLatency.
The high latency occurred during a period where we were doing a lot of automatic writes. In our use case, writing to dynamo also includes some reading (to get any existing records). However, we often write the same quantity of data in the same period of time without causing any increased latency.
Is there any way to understand what contributes to an increase in SuccessfulRequest latency where it seems that we have provisioned enough read capacity? Is there any way to diagnose the latency caused by this set of writes to dynamodb?
You can dig deeper by checking the Get Latency and Put Latency in CloudWatch.
As you have already mentioned, there was no throttling, and your writes involve some reading as well, and your writes at other period of time don't cause any latency, you should check for what exactly in read operation is causing this.
Check SuccessfulRequestLatency metric while including the Operation dimension as well. Start with GetItem and BatchGetItem. If that doesn't
help include Scan and Query as well.
High request latency can sometimes happen when DynamoDB is doing an internal failover of one of its storage nodes.
Internally within Dynamo each storage partition has to be replicated across multiple nodes to provide a high level of fault tolerance. Occasionally one of those nodes will fail and a replacement node has to be introduced, and this can result in elevated latency for a subset of affected requests.
The advice I've had from AWS is to use a short timeout and a fast retry (e.g. 100ms) if your use-case is latency-sensitive. It's my understanding that only requests that hit the affected node experience increased latency, so within one or two retries you'll hit a different node and get a successful response, with minimal impact on your overall latency. Obviously it's hard to verify this, because it's not a scenario you can reproduce!
If you've got a support contract with AWS, it's well worth submitting a support ticket from the AWS console when events like this happen. They are usually able to provide an insight into what actually happened.
Note: If you're doing retries, remember to use exponential backoff to reduce the risk of throttling.

Flume channel Capacity is full giving exception

I have started flume Twitteragent to fetch the twitter data into the hdfs after few mins the data was not able to write to the hdfs it was poping a message in the terminal as mentioned below.
Error
org.apache.flume.ChannelException: Take list for MemoryTransaction, capacity 100 full, consider committing more frequently, increasing capacity, or increasing thread count at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doTake(MemoryChannel.java:101)at org.apache.flume.channel.BasicTransactionSemantics.take(BasicTransactionSemantics.java:113)at org.apache.flume.channel.BasicChannelSemantics.take(BasicChannelSemantics.java:95)at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:391)at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:853)2015-10-16 15:14:18,126 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERRORorg.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: org.apache.flume.ChannelException: Take list for MemoryTransaction, capacity 100 full, consider committing more frequently, increasing capacity, or increasing thread count
I guess we need to increase the channel capacity but not sure needed a help.

DynamoDB write latency [duplicate]

In the DynamoDB documentation and in many places around the internet I've seen that single digit ms response times are typical, but I cannot seem to achieve that even with the simplest setup. I have configured a t2.micro ec2 instance and a DynamoDB table, both in us-west-2, and when running the command below from the aws cli on the ec2 instance I get responses averaging about 250 ms. The same command run from my local machine (Denver) averages about 700 ms.
aws dynamodb get-item --table-name my-table --key file://key.json
When looking at the CloudWatch metrics in the AWS console it says the average get latency is 12 ms though. If anyone could tell me what I'm doing wrong or point me in the direction of information where I can solve this on my own I would really appreciate it. Thanks in advance.
The response times you are seeing are largely do to the cold start times of the aws cli. When running your get-item command the cli has to get loaded into memory, fetch temporary credentials (if using an ec2 iam role when running on your t2.micro instance), and establish a secure connection to the DynamoDB service. After all that is completed then it executes the get-item request and finally prints the results to stdout. Your command is also introducing a need to read the key.json file off the filesystem, which adds additional overhead.
My experience running on a t2.micro instance is the aws cli has around 200ms of overhead when it starts, which seems inline with what you are seeing.
This will not be an issue with long running programs, as they only pay a similar overhead price at start time. I run a number of web services on t2.micro instances which work with DynamoDB and the DynamoDB response times are consistently sub 20ms.
There are a lot of factors that go into the latency you will see when making a REST API call. DynamoDB can provide latencies in the single digit milliseconds but there are some caveats and things you can do to minimize the latency.
The first thing to consider is distance and speed of light. Expect to get the best latency when accessing DynamoDB when you are using an EC2 instance located in the same region. It is normal to see higher latencies when accessing DynamoDB from your laptop or another data center. Note that each region also has multiple data centers.
There are also performance costs from the client side based on the hardware, network connection, and programming language that you are using. When you are talking millisecond latencies the processing time on your machine can make a difference.
Another likely source of the latency will be the TLS handshake. Establishing an encrypted connection requires multiple round trips and computation on both sides to get the encrypted channel established. However, as long as you are using a Keep Alive for the connection you will only pay this overheard for the first query. Successive queries will be substantially faster since they do not incur this initial penalty. Unfortunately the AWS CLI isn't going to keep the connection alive between requests, but the AWS SDKs for most languages will manage this for you automatically.
Another important consideration is that the latency that DynamoDB reports in the web console is the average. While DynamoDB does provide reliable average low double digit latency, the maximum latency will regularly be in the hundreds of milliseconds or even higher. This is visible by viewing the maximum latency in CloudWatch.
They recently announced DAX (Preview).
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. For more information, see In-Memory Acceleration with DAX (Preview).

Resources