When a request such as dynamoClient.update(params) is throttled, my understanding is that the SDK automatically attempts a number of retries and then the call fails only if all retries fail. If one of them succeeds, then the call succeeds.
My question has to do with CloudWatch reporting throttled requests. If a request originally fails, but one of the retires succeeds, is that reported as a throttled request? Or does it only get reported as throttled only if all retries fail?
What I'm really trying to understand when looking at the behavior of my system is how often retries are occurring and whether they're eventually failing or succeeding. When I see a report that 50 requests were throttled, does that mean all 50 failed all retries? Or could some of those 50 have eventually succeeded? If the latter, how can I get a sense of how many eventually succeeded, and how many eventually completely failed?
The CloudWatch metric ThrottledRequests (full details) is incremented each time a DynamoDB read/write operation failed because you'd hit your provisioned throughput limit. If the aws-sdk retries three times and succeeds on the third thenThrottledRequests is incremented by 2.
(n.b. there are some nuances when it comes to batch requests, which are outlined full in the linked docs).
Measurement of "failed requests because the sdk gave up" is a bit harder. This is because it the 'failures' have already been recorded as ThrottledRequests. Once the aws-sdk reaches the maximum number of configured retires, the request will fail. At this point you can log the failure, which you might then make available using a custom CloudWatch metric.
If you want to specifically measure "requests that retried but succeded" you can inspect the property Response.retryCount (docs) and log accordingly.
Related
I have a webapp and a Windows Service which communicate using Firebase Cloud Messaging. The webapp subscribes to a couple of Topics to receive messages, and Windows Service App sends messages to one of these Topics. In some cases it can be several messages per seconds, and it gives me this error:
FirebaseAdmin.Messaging.FirebaseMessagingException: Topic quota exceeded
I don't quite get it. Is there a limit to messages that can be sent to a specific topic, or what is the meaning?
I have found until now only info about topic names and subscription limits, but I actually couldn't find anything about "topic quota", except maybe this page of the docs (https://firebase.google.com/docs/cloud-messaging/concept-options#fanout_throttling) although I am not sure it refers to the same thing, and in case if and how it can be changed. In the Firebase Console I can't find anything either. Has anybody got an idea?
Well.. from this document it seems pretty clear that this can happen:
The frequency of new subscriptions is rate-limited per project. If you
send too many subscription requests in a short period of time, FCM
servers will respond with a 429 RESOURCE_EXHAUSTED ("quota exceeded")
response. Retry with exponential backoff.
I do agree that the document should've state how much quantity will trigger the block mechanism instead of just telling the developer to "Retry with exponential backoff". But, at the end of the day, Google also produced this document to help developers understand how to properly implement this mechanism. In a nutshell:
If the request fails, wait 1 + random_number_milliseconds seconds and
retry the request.
If the request fails, wait 2 + random_number_milliseconds seconds and
retry the request.
If the request fails, wait 4 + random_number_milliseconds seconds and
retry the request.
And so on, up to a maximum_backoff time.
My conclusion: reduce the amount of messages send to topic OR implement a retry mechanism to recover unsuccessful attempts
It could be one of these issue :
1. Too high subscriptions rates
Like noted here
The frequency of new subscriptions is rate-limited per project. If you send too many subscription requests in a short period of time, FCM servers will respond with a 429 RESOURCE_EXHAUSTED ("quota exceeded") response. Retry with exponential backoff.
But this don't seem to be your problem as you don't open new subscriptions, but instead send messages at high rate.
2. Too many messages sent to on device
Like noted here
Maximum message rate to a single device
For Android, you can send up to 240 messages/minute and 5,000 messages/hour to a single device. This high threshold is meant to allow for short term bursts of traffic, such as when users are interacting rapidly over chat. This limit prevents errors in sending logic from inadvertently draining the battery on a device.
For iOS, we return an error when the rate exceeds APNs limits.
Caution: Do not routinely send messages near this maximum rate. This
could waste end users’ resources, and your app may be marked as
abusive.
Final notes
Fanout throttling don't seems to be the issue here, as the rate limit is really high.
Best way to fix your issue would be :
Lower your rates, control the number of "devices" notified and overall limit your usage over short period of time
Keep you rates as is but implement a back-off retries policy in your Windows Service App
Maybe look into a service mor suited for your usage (as FCM is strongly focused on end-client notification) like PubSub
We have a backend API which runs in almost constant time (it does "sleep" for given period). When we run a managed API which proxies to it for a long time, we see that from time to time its execution time increases up to twice the average.
From analyzing the Amazon ALB data in production, it seems that the time the request spends inside Synapse remains the same, but the connection time (the time the request enters the queue for processing) is high.
In an isolated environment we noticed that those lags happen approximately every 10 minutes. In production, where we have multiple workers that gets request all the time, the picture is more obscured, as it happens more often (possibly the lag accumulates).
Does anyone aware of any periodic activity in the worker which result delays entering the queue every
few minutes? Any parameter that control this? Otherwise, any idea how to figure out what is the cause?
Attached is an image demonstrating it.
Could be due to gateway token cache invalidation. The default timeout is 15 minutes.
Once the key validation is done via the Key Manager, the key validation info is cached in the gateway. The subsequent API invocations, key validation will be done from this cache. During this time, the execution time will be lower.
After the cache invalidation, the token validation will be done via the key manager (using DB). This will cause an increase in the execution time.
Investigating further, we found out two causes for the spikes.
Our handler writes log to shared file system which was set as sync instead of async. This caused delays. This reduced most of the spikes.
Additional spikes seem to be related to registry updates. We did not investigate those, as they were more sporadic.
Sometimes, I get throttled events in Dynamodb because of high traffic. Whenever in the metrics, I can see throttled events, does it mean for those cases, data is not being written to the database?
Yes but, are you using an AWS SDK? If so, then it should have retried...
From the docs
Throttling prevents your application from consuming too many capacity
units. When a request is throttled, it fails with an HTTP 400 code
(Bad Request) and a ProvisionedThroughputExceededException. The AWS
SDKs have built-in support for retrying throttled requests (see Error
Retries and Exponential Backoff), so you do not need to write this
logic yourself.
I have developed a Kafka version : 0.9.0.1 application that cannot afford to lose any messages.
I have a constraint that the messages must be consumed in the correct sequence.
To ensure I do not loose any messages I have implemented Retries within my application code and configured my Producer to ack=all.
To enforce exception handling and to Fail Fast I immediately get() on the returned Future from Producer.send(), e.g.
final Future<RecordMetadata> futureRecordMetadata = KAFKA_PRODUCER.send(producerRecord);
futureRecordMetadata.get();
This approach works fine for guaranteeing the delivery of all messages, however the performance is completely unacceptable.
For example it takes 34 minutes to send 152,125 messages with ack=all.
When I comment out the futureRecordMetadata.get(), I can send 1,089,125 messages in 7 minutes.
When I change ack=all to ack=1 I can send 815,038 in 30 minutes. Why is there such a big difference between ack=all and ack=1?
However by not blocking on the get() I have no way of knowing if the message arrived safely.
I know I can pass a Callback into the send and have Kafka retry for me, however this approach has a drawback that messages may be consumed out of sequence.
I thought request.required.acks config could save the day for me, however when I set any value for it I receive this warning
130 [NamedConnector-Monitor] WARN org.apache.kafka.clients.producer.ProducerConfig - The configuration request.required.acks = -1 was supplied but isn't a known config.
Is it possible to asynchronously send Kafka messages, with a guarantee they will ALWAYS arrive safely and in the correct sequence?
UPDATE 001
Is there anyway I can consume messages in kafka message KEY order direct from the TOPIC?
Or would I have to consume messages in offset order then sort programmatically
to Kafka message Key order?
If you expect a total order, the send performance is bad. (actually total order scenario is very rare).
If Partition order are acceptable, you can use multiple thread producer. One producer/thread for each partition.
I'm currently investigating Rebus but being unable to find good documentation this process is proving difficult. I am hoping someone can help me understand this exciting product.
I have read that during message processing, if something goes wrong the message will return to the queue.
Is the message returned to the front of the queue or placed on the end? If placed on the front this will be problem because the queue in essence becomes blocked with a message that may not be able to be processed - at least until it times out or retries exceeded.
Does Rebus have support for an out-of-the-box separate Retry queue?
Can I specify the interval between retries?
Can I specify an exponential backoff interval for retries as in Apache ActiveMQ?
Thanks
1) The queue transaction is rolled back, effectively moving the message back in front - therefore, it will be immediately retried.
After 5 failed attempts (at least that is the default), Rebus will move the message to the error queue. The default retry mechanism is intentionally very swift - this way, the input queue will never be clogged by poisonous messages.
If you need more sophisticated retries, I suggest you tage a look at bus.Defer - it can defer delivery of a message to the future. It requires that you have a timeout manager(*) running though.
2) I guess that's what I call "error queue", except there's no retry :)
I did create a solution some time, though, where I coded a simple endpoint that would periodically empty the error queue and move all the messages back into the original source queue, as a form of crude automatic second-level retry mechanism.
3) No. NServiceBus has the concept of second-level retries, but this is something that I've never really needed (enough) with Rebus. But with Rebus, you're on your own here - it should be fairly easy to do some intelligent bus.Defer that can then be easily adapted to each kind of error that you're expecting.
4) See (3)
I hope that clarifies a bit :)
(*) The timeout manager can be a separate endpoint whose only job in life is to receive a message, hold on to it for a while (i.e. save it to a database), and then return it to the sender when the time has elapsed. The timeout manager can be hosted in-process though, but using the .Timeouts(t => t.???) configuration spell.