How to avoid latency in EventHub consumer data ?
My Architecture (data flow): IOTHub -> EventHub -> BlobStorage (No deviation from IOTHub packet to Blob Storage JSON packet)
Deviation occurs only on consumer application side (Listener is receiving with delay of 30-50 seconds)
Azure Configuration: 4 Partitions with standard S2 tier subscription.
Publisher: 3000 packets per minute.
My question: BlobStorage has proper data without deviation, but why listener part is receiving with latency. How could I overcome this ?
Tried with EventProcessorClient with respective handlers as suggested in GitHub sample code. Works fine without error. But having huge latency.Tried EventHubProducerClient as well. still same latency issue.
I can't speak to how IoT Hub manages data internally or what it's expected latency between IoT data being received and when IoT Hub itself publishes to Event Hubs.
With respect to Event Hubs, you should expect to see individual events with varying degrees of end-to-end latency. Event Hubs is optimized for throughput (the number of events that flow through the system) and not for the latency of an individual event (the amount of time it takes for it to flow from publisher to consumer).
What I'd suggest monitoring is the backlog of events available to be read in a partition. If there are ample events already available in the partition and you’re not seeing them flow consistently through the processor as fast as you’re able to process them, that’s something we should look to tune.
Additional Event Hubs context
When an event is published - by IoT Hub or another producer - the operation completes when the service acknowledged receipt of the event. At this point, the service has not yet committed the event to a partition, and it is not available to be read. The time that it takes for an event to be available for reading varies and has no SLA associated with it. Most often, it’s milliseconds but can be several seconds in some scenarios – for example, if a partition is moving between nodes.
Another thing to keep in mind is that networks are inherently unreliable. The Event Hubs consumer types, including EventProcessorClient, are resilient to intermittent failures and will retry or recover, which will sometimes entail creating a new connection, opening a new link, performing authorization, and positioning the reader. This is also the case when scaling up/down and partition ownership is moving around. That process may take a bit of time and varies depending on the environment.
Finally, it's important to note that overall throughput is also limited by the time that it takes for you to process events. For a given partition, your handler is invoked and the processor will wait for it to complete before it sends any more events for that partition. If it takes 30 seconds for your application to process an event, that partition will only see 2 events per minute flow through.
Related
We have a bus reservation system running in GKE in which we are handling the creation of such reservations with different threads. Due to that, CRUD java methods can sometimes run simultaneously referring to the same bus, resulting in the save in our DB of the LAST simultaneous update only (so the other simultaneous updates are lost).
Even if the probabilities are low (the simultaneous updates need to be really close, 1-2 seconds), we need to avoid this. My question is about how to address the solution:
Lock the bus object and return error to the other simultaneous requests
In-memory map or Redis caché to track the bus requests
Use GCP Pub/Sub, Kafka or RabbitMQ as a queue system.
Try to focus the efforts on reducing the simultaneous time window (reduce from 1-2 seconds up to milliseconds)
Others?
Also, we are worried if in the future the GKE requests handling scalability may be an issue. If we manage a relatively higher number of buses, should we need to implement a queue system between the client and the server? Or GKE load balancer & ambassador will already manages it for us? In case we need a queue system in the future, could it be used also for the collision problem we are facing now?
Last, the reservation requests from the client often takes a while. Therefore, we are changing the requests to be handled asynchronously with a long polling approach from the client to know the task status. Could we link this solution to the current problem? For example, using the Redis caché or the queue system to know the task status? Or should we try to keep the requests synchronous and focus on reducing the processing time (it may be quite difficult).
We are using Cosmos DB SDK whose version is 2.9.2. We perform Document CRUD operations. Usually, the end-to-end P95 latency is 20ms. But sometimes the latency is over 1000ms. The high latency period lasts for 10 hours to 1 day. The collection is not throttling.
We have get some background information from:
https://icm.ad.msft.net/imp/v3/incidents/details/171243015/home
https://icm.ad.msft.net/imp/v3/incidents/details/168242283/home
There are some diagnostics strings in the tickets.
We know that the client maintains a cache of the mapping of logical partition and physical replica address. This mapping may be outdated because of replicas movement or outage. So client tries to read from the second/third replica. However, this retry has significant impact on end to end latency. We also observe that the high latency/timeout can last for several hours, even days. I expect there’s some mechanism of refreshing mapping cache in the client. But it seems the client stops visiting more than one replica only after we redeploy our service.
Here are my questions:
How can the client tell whether it’s unable to connect to a certain replica? Will the client wait until timeout or server tells client that the replica is unavailable?
In which condition the mapping cache will be refreshed? We are using Session consistency and TCP mode.
Will restarting our service force the cache to be refreshed? Or refreshing only happens when the machine restarts?
When we find there’s replica outage, is there any way to quickly mitigate?
What operations are performed (Document CRUD or query)?
And what are the observed latencies & frequencies? Also please check if the collection is throttling (with custom throttling policy).
Client do manage the some metada and does handle its staleness efficiently with-in SLA bounds.
Can you please create a support ticket with account details and 'RequestDiagnostis' and we shall look into it.
As anyofferschange notification amount varies with time. We don't have any specific way to read multiple notifications together.
So, I am reading one by one and saving some of information in sql server database, It takes quite a lot of time that I can never finish reading all the notifications.
What is the best possible way to achieve this?
Here's what I did...I started by clearing out the queue. Then I started my windows service that every few seconds polled the queue. I think I pulled back 10 messages at time. I would get a total count of messages and then spin up a number of threads that could handle the amount of messages I had waiting. One by one, I read the message, add to my SQL database, then delete the message from SQS.
Over time, I understood better how many threads to spin up and how often to poll my queue. As long as my service was running, I would maintain just a handful of SQS messages in the queue at a time and I would quickly read and process them. Occasionally, due to bad programming (yeah, it happens), my service would crash and I wouldn't know about it. Tens of thousands of messages become queued up and I would put my service in "crisis" mode, which polled at an increasing rate and essentially maxed out the number of calls I could make to SQS. Usually in a few hours, my service would catch up and then I increase the polling interval. Sometimes though, I would just dump the queue and start over as I'd have potentially hundreds of price changes on a single SKU and didn't want to waste the processing time to go through them. But most of the time, things ran smoothly.
Why can't you read more than one notification together? Like I said, I believe I read 10 at a time on each thread. Once I got the 10 messages, I processed them in a loop and dumped them to a SQL database. Once the 10 were processed, I send a message up to SQS to delete.
I ran this for several years on an account with over 10,000 SKU's. We had up to the minute price change notifications on all our products and could instantly reprice and update Amazon, if needed.
I'm running BizTalk 2006, and I have an orchestration that receives a series of messages (orders) that are correlated on BTS.MessageType. On my delay shape, I check the time until midnight, which is the batch cut off. I'm getting occasional instances where I receive a message after the loop ends, and this is creating Zombie messages. I still need these messages to process, but in a new instance of the orchestration. I need some ideas of how to handle this gracefully.
One option would be to correlate on the date (in addition to BTS.MessageType)
You would have to create a pipeline component that promotes the date without the time. But there could be some time window where messages would go "randomly" either to the old or new instance (for example if you have multiple BizTalk servers with slightly different times, or if the system clocks is resynchronized with a NTP source). To be safe, wait a few minutes before ending the previous day's instance.
If that window of overlap between the old and new instances is a problem, you should instead correlate on another value that changes only once a day, such as a Guid stored in a database and promoted by a pipeline component.
Otherwise, I've successfully used your "hackish" solution in past projects, as long as you can tolerate a small window where messages are queued and not processed immediately for a few minutes every day. In my case it was fine because messages are produced by american users during their work day, and sent by FTP or MSMQ. However if you have international users that sent messages by web services, then you may not have a time in the day where you probably won't receive anything, and the web services won't be able to queue the messages for later processing.
I need to log to the database every call to my Web API.
Now of course I don't want to go to my database on every call.
So lets say I have a dictionary or a hash table object in my cache,
and every 10000 records I go to the database.
I still don't want this every 10000 user to wait for this operation.
And I can't start a different thread for long operations since the application pool
can be recycled basically on anytime.
What is the best solution for this scenario?
Thanks
I would argue that your view of durability is rather inconsistent. Your cache of 10000 objects could also be lost at any time due to an app pool recycle or server crash.
But to the original question of how to perform a large operation without causing the user to wait:
Put constraints on app pool recycling and deal with the potential data loss.
Periodically dump the cached messages to a Windows service for further processing. This is still not 100% guaranteed to preserve data, e.g. the service/server could crash.
Use a message queue (MSMQ), possibly with WCF. A message queue can persist to disk, so this can be considered reasonably reliable.
Message Queuing (MSMQ) technology enables applications running at
different times to communicate across heterogeneous networks and
systems that may be temporarily offline. Applications send messages to
queues and read messages from queues.
Message Queuing provides guaranteed message delivery, efficient
routing, security, and priority-based messaging. It can be used to
implement solutions to both asynchronous and synchronous scenarios
requiring high performance.
Taking this a step further...
Depending on your requirements and/or environment, you could probably eliminate your cache, and write all messages immediately (and rapidly) to a message queue and not worry about performance loss or a large write operation.