Amazon DynamoDB Async client http metrics - amazon-dynamodb

I am using Amazon SDK (Java) DynamoDB Async client v2.10.14 with custom configuration:
DynamoDbAsyncClientBuilder = DynamoDbAsyncClient
.builder()
.region(region)
.credentialsProvider(credentialsProvider)
.httpClientBuilder(
NettyNioAsyncHttpClient.builder()
.readTimeout(props.readTimeout)
.writeTimeout(props.writeTimeout)
.connectionTimeout(props.connectionTimeout)
)
I often run into a connect timeout:
io.netty.channel.ConnectTimeoutException: connection timed out: dynamodb.region.amazonaws.com/1.2.3.4:443
I expect this is due to my settings, but I need aggressive timeouts. I used to run into the same issues with the defaults anyway (it just took longer). I would like to know why I am getting into this situation. My gut feel is that it's related to connection pool exhaustion, or other issues with the pool.
Are there any metrics I can turn on to monitor this?

It seems like your application is a "latency-aware DynamoDB client application". The underlying HTTP client behavior needs to be tuned for its retry strategy. Luckily, the AWS Java SDK provides full control over the HTTP client behavior and retry strategies, this documentation helps to explain how to tune AWS Java SDK HTTP request settings and parameters for the underlying HTTP client:
https://aws.amazon.com/blogs/database/tuning-aws-java-sdk-http-request-settings-for-latency-aware-amazon-dynamodb-applications/
In the document it provides an example on how to tune the five HTTP client configuration parameters that are set during the ClientConfiguration object creation and discusses very comprehensively about each of the parameters:
ConnectionTimeout
ClientExecutionTimeout
RequestTimeout
SocketTimeout
The DynamoDB default retry policy for HTTP API calls with a custom
maximum error retry count
Tuning this latency-aware DynamoDB application "requires an understanding of the average latency requirements and record characteristics (such as the number of items and their average size) for your application, interdependencies between different application modules or microservices, and the deployment platform. Careful application API design, proper timeout values, and a retry strategy can prepare your application for unavoidable network and server-side issues"
Quoted from: https://aws.amazon.com/blogs/database/tuning-aws-java-sdk-http-request-settings-for-latency-aware-amazon-dynamodb-applications/

Related

I have multiple JMSlisteners where am using #JMSListener annotation with queue as destination. Observing the loss of events in our listeners

The volume of our application is too huge i.e around 500million per day. We are observing some loss of events. Please share if anyone has observed loss of events while consuming using #JmsListener.
There could be many reasons:
some of your messages are producing errors and they are generating rollbacks which you didn't handle carefully
you are not using transactions while you are having connectivity / sync / infrastructure issues
it can be your configuration: not using persistent messaging, caching messages on client side within driver (etc..)
Whatever it is it is NOT issue of #JmsListner annotation, but of code, driver and messaging configuration.

Microservice Design with SignalR

We have built a microservice architecture but have run into an issue where the messages going on to the bus are too large. (Discovered since moving to Azure Service bus as this only allows 256KB compared to RabbitMQ 4MB)
We have a design as the below diagram. Where we're struggling is with the data being returned.
An example is when performing a search and returning multiple results.
To step through our current process:
Web client sends a http request to the Web Api.
Web api then puts appropriate message on to the bus. (Web api responds to client with an Accepted response)
Microservice picks up this message.
Microservice queries its database for the records matching search criteria.
Results returned from database.
A SearchResult message is added to the bus. (This contains the results)
Our response microservice is listening for this SearchResult message.
The response microservice then posts to our SignalR api.
SignalR Api sends the results back to the web client.
My question is how do we deal with large results sets when designed in this way? If it's not possible how should the design be changed to handle large results sets?
I understand we could page the results but even so one result could be over the 256KB allowance, for example a document or a particularly large object.
There are 2 ways :-
Use Kafka like system which support large size messages.
If you can't go with the 1st approach (that's appear from your question), then Microservices can place 2 types of messages for response service
(1.) If size is small then place the complete message and
(2.) If size is more than supported then place message that contain link to Azure Storage Blob which have result
Based on message, response service can get proper result and return the same to Client.

Does Throttled write events means my data is not written in database for those?

Sometimes, I get throttled events in Dynamodb because of high traffic. Whenever in the metrics, I can see throttled events, does it mean for those cases, data is not being written to the database?
Yes but, are you using an AWS SDK? If so, then it should have retried...
From the docs
Throttling prevents your application from consuming too many capacity
units. When a request is throttled, it fails with an HTTP 400 code
(Bad Request) and a ProvisionedThroughputExceededException. The AWS
SDKs have built-in support for retrying throttled requests (see Error
Retries and Exponential Backoff), so you do not need to write this
logic yourself.

ASP.Net API App - continual HTTP 502.3 errors

My team and I have been at this for 4 full days now, analyzing every log available to us, Azure Application Insights, you name it, we've analyzed it. And we can not get down to the cause of this issue.
We have a customer who is integrated with our API to make search calls and they are complaining of intermittent but continual 502.3 Bad Gateway errors.
Here is the flow of our architecture:
All resources are in Azure. The endpoint our customers call is a .NET Framework 4.7 Web App Service in Azure that acts as the stateless handler for all the API calls and responses.
This API app sends the calls to an Azure Service Fabric Cluster - that cluster load balances on the way in and distributes the API calls to our Search Service Application. The Search Service Application then generates and ElasticSearch query from the API call, and sends that query to our ElasticSearch cluster.
ElasticSearch then sends the results back to Service Fabric, and the process reverses from there until the results are sent back to the customer from the API endpoint.
What may separate our process from a typical API is that our response payload can be relatively large, based on the search. On average these last several days, the payload of a single response can be anywhere from 6MB to 12MB. Our searches simply return a lot of data from ElasticSearch. In any case, a normal search is typically executed and returned in 15 seconds or less. As of right now, we have already increased our timeout window to 5 minutes just to try to handle what is happening and reduce timeout errors for the fact their searches are taking so long. However, we increased the timeout via the following code in Startup.cs:
services.AddSingleton<HttpClient>(s => {
return new HttpClient() { Timeout = TimeSpan.FromSeconds(300) };
});
I've read in some places that you actually have to do this in the web.config file as opposed to here, or at least in addition to it. Not sure if this is true?
So The customer who is getting the 502.3 errors have significantly increased the volumes they are sending us over the last week, but we believe we are fully scaled to be able to handle it. They are still trying to put the issue on us, but after many days of research, I'm starting to wonder if the problem is actually on their side. Could it be possible that they are not equipped to take the increased payload on their side. Can it be that their integration architecture is not scaled enough to take the return payload from the increased volumes? When we observe our resources usages (CPU/RAM/IO) on all of the above applications, they are all normal - all below 50%. This also makes me wonder if this is on their side.
I know it's a bit of a subjective question, but I'm hoping for some insight from someone who may have experienced this before, but even more importantly, from someone who has experience with a .Net API app in Azure which return large datasets in it's responses.
Any code blocks of our API app, or screenshots from Application Insights are available to post upon request - just not sure what exactly anyone would want to see yet as I type this.

IIS Request Logging

We are looking to add some performance measuring into our LOB web application. Is there a way to log all requests into IIS including the details of the request, the upload speed and time, the latency and the download speed and time?
We will store this into a log file so the customer can post this to us for analysis (the customer internally hosts our LOB web application).
Thanks
IIS 7 natively provides logging features. It will give you basic informations about requests (status code, date, call duration, IP, referer, ...) It's already a good starting point and it's very easy to enable in IIS manager.
Advanced Logging, distributed here or via WPI, give you a way to log additional information (http headers, http responses, custom fields...) . A really good introduction is available here.
that's the best you can do without entering into asp.net
There is no out-of-box direct solution for your problem. As Cybermaxs suggests you can use W3C logs to get information about requests, but those logs do not break down the request/response times in the way you seek.
You have two options:
1) Write an IIS module (C++ implementing CHttpModule in HTTPSERV.H) which intercepts all the relevant events and logs the times as you require. The problem with this solution is that writing these modules can be tricky and is error-prone.
2) Leverage IIS's Failed Request Tracing (http://www.iis.net/learn/troubleshoot/using-failed-request-tracing/troubleshoot-with-failed-request-tracing) which will cause IIS to write detailed logs which include a break down of time spent per request in a verbose/parseable XML format. You can enable "Failed Request Tracing" even for successful requests. The problem is that an individual XML file is generated for each request so you'll have to manage the directory (and Failed Request tracing configuration) so that this behaviour doesn't cause too much pain for your customer.

Resources