Ensure In Process Records are Unique ActiveMQ - asynchronous

I'm working on a system where clients enter data into a program and the save action posts a message to activemq for more time intensive processing.
We are running into rare occasions where a record will be updated by a client twice in a row and a consumer on that activemq queue will process the two records at the same time. I'm looking for a way to ensure that messages containing records with the same identity are processed in-order and only one at a time. To be clear if a record with ID 1, 1, and 2 (in that order) are sent to activemq, 1 would process, then 2 (if 1 was still in process) and finally 1.
Another requirement, (due to volume) requires that the consumer be multi-threaded, so there may be 16 threads accessing that queue. This would have to be taken into consideration.

So if you have multiple threads reading that queue and you want the solution to be close to ActiveMQ you have to think about how you scale related to order concerns.
If you have multiple consumers, they may operate at different speed and you can never be sure which consumer goes before the other. The only way is to have a single consumer (you can still achieve High Availability by using exclusive-consumers).
You can, however, segment the load in other ways. How depends a lot on your application. If you can create, say 16 "worker" queues (or whatever your max consumer count would be) and distribute load to these queues while guarantee that requests from a single user always come to the same "worker queue", message order will remain per user.
If you have no good way to divide users into groups, simply take the userID mod MAX_CONSUMER_THREADS as a simple solution.
There may be better ways to deal with this problem in the consumer logic itself. Like keeping track of the sequence number and postpone updates that are out of order (scheduled delay can be used for that).

Related

Redis streams - free struck messages in a consumer group without claiming

Lets say, there are messages in a Redis consumer group that has not been processed for N seconds. I am trying to understand if its possible to free them and put them back for other members of the consumer group to see it. I don't want to claim/process these struck messages. I just want to make them accessible to other active members of the consumer group. Is this possible?
From what I have understood from the documents, options mentioned are XAUTOCLAIM or use a combination of XPENDING and XCLAIM and neither of these are meeting my requirements.
Essentially, I am trying to create a standalone process that can act as monitor and make those messages visible to active consumers in the consumer group and I am planning to use this standalone process to perform similar activity for multiple consumer groups (around 30). So I don't want this standalone process to be taking other actions.
Please suggest how this can be designed.
Thanks!
Pending messages are removed from the Redis' PEL only when they are acknowledged: this is by design and allows to scale the message re-distribution process to each individual consumer and to avoid the single point of failure condition of having a single monitoring process like the one you described.
So, in short, what you are looking for can't be done and I would suggest to consider using XAUTOCLAIM or XPENDING / XCLAIM into your consumer processes instead.

Kafka consumer synchronization behavior

I am currently exploring kafka as a beginner for a simple problem.
There will one Producer pushing message to one Topic but there will
be n number of Consumer of spark application massage the data from
kafka and insert into database (each consumer inserts to different
table).
Is there a possibility that consumers will go out of sync (like some part of the consumer goes down for quite some time), then
one or more consumer will not process the message and insert to table
?
assuming the code is always correct, no exception will arise when
massaging the data. It is important that every message is processed
only once.
My question is that does Kafka handles this part for us or do we have to write some other code to make sure this does not happen.
You can group consumers (see group.id config) and that grouped consumers split topic's partitions among themselves. Once a consumer drops, another consumers from the group will take over partitions read by dropped one.
However, there may be some problems: when consumer read a partition it commit offset back to Kafka and if consumer dropped after it processed received data but before commit offset, other consumers will start read from the latest available offset. Fortunately, you can manage strategy of how offset is committed (see consumer's settings enable.auto.commit, auto.offset.reset etc)
Kafka and Spark Streaming guide provide some explanations and possible strategies of how to manage offsets.
By design Kafka decouples the producer and the consumer. Consumer will read as fast as they can - and consumers can produce as fast as they can.
Consumers can be organized into "consumer groups" and you can set it up so that multiple consumers can read from a single group as well set it up so that an individual consumer reads from its own group.
If you have 1 consumer to 1 group you (depending on your acknowledgement strategy) should be able to ensure each message is read only once (per consumer).
Otherwise if you want multiple consumer reading from a single group - same thing - but the message is read once by a one of n consumers.

Use of numWorkers in firebase-queue

I am using firebase-queue in a mobile app to handle some server side work. In the firebase-queue documentation here, it says that we can specify an optional parameter numWorkers which specifies number of workers that can run simultaneously for the node.js thread. I don't fully understand how to use this parameter in my application. For e.g., one of the thing that I am doing on the server side using firebase-queue is to send a verification code to the user when he/she first logs into the application. Now this could be hundreds of users in the future. I have a few questions that I wanted to clarify to understand the user of numWorkers a little better
When should I have more than one worker for a firebase queue?
What is the optimum number of workers for any firebase queue? Coming from a Java background, it's said that having more and more threads running in an application may start to become an overhead after a certain limit. Not sure if similar principles apply here.
If I have more than one queue serving different specIds, then do I need to think about of number of total workers at cumulative level rather than per queue. I have four queues at the moment.
Please let me know if you have information in regards to my questions above. Any inputs are appreciated.
Update - June 5, 2016
After some more playing around with the firebase-queue, I have realized that the numWorkers controls how many task of a given spec can be running simultaneously. Since the queue worker is not working in an asynchronous fashion, if tasks of a given specId takes a long time to finish, then you may end up with many tasks in the queue waiting to be picked up. For e.g. if there is a network element in the processing of the task, then it may take longer to finish and if you expect a lot of these tasks to be present on the queue, then you should have a more than one worker in the firebase queue. So, I know the answer to my first question now.
I am still wondering about question 2 and 3. I have some tasks in the queue which could be in hundreds or thousands at a given time and some of them involve a network element, so they may take a considerable amount of time to finish. I am not sure of the repercussions of having say hundred workers for a queue. I am not able to test it myself since my app is still in development state and I don't have a setup to simulate a large amount of such tasks at the moment.

Kafka - Dynamic / Arbitrary Partitioning

I'm in the process of building a consumer service for a Kafka topic. Each message contains a url to which my service will make an http request. Each message / url is completely independent from other messages / urls.
The problem I'm worried about is how to handle long-running requests. It's possible for some http requests to take 50+ minutes before a response is returned. During that time, I do not want to hold up any other messages.
What is the best way to parallelize this operation?
I know that Kafka's approach to parallelism is to create partitions. However, from what I've read, it seems that you need to define the number of partitions up front when I really want an infinite or dynamic number of partitions (ideally each message gets its own partition created on the fly)
As an example, let's say I create 1,000 partitions. If 1,001+ messages are produced to my topic, the first 1,000 requests will be made but every message after that will be queued up until the previous request in that partition finishes.
I've thought about making the http requests asynchronous but then I seem to run into a problem when determining what offset to commit.
For instance, on a single partition I can have a consumer read the first message and make an async request. It provides a callback function which commits that offset to Kafka. While that request is waiting, my consumer reads the next message and makes another async request. If that request finishes before the first it will commit that offset. Now, what happens if the first request fails for some reason or my consumer process dies? If I've already committed a higher offset, it sounds like this means my first message will never get reprocessed, which is not what I want.
I'm clearly missing something when it comes to long-running, asynchronous message processing using Kafka. Has anyone experienced a similar issue or have thoughts on how to best solve this? Thanks in advance for taking the time to read this.
You should look at Apache Storm for the processing portion of your consumer and leave the message storage and retrieval to Kafka. What you've described is a very common use case in Big Data (although the 50+ minute thing is a bit extreme). In short, you'll have a small number of partitions for your topic and let Storm stream processing scale the number of components ("bolts" in Storm-speak) that would actual make the http requests. A single spout (the kind of storm component that reads data from an external source) could read the messages from the Kafka topic and stream them to the processing bolts.
I've posted an open source example of how to write a Storm/Kafka application on github.
Some follow-on thoughts to this answer:
1) While I think Storm is the correct platform approach to take, there's no reason you couldn't roll your own by writing a Runnable that performs the http call and then write some more code to make a single Kafka consumer read messages and process them with multiply-threaded instances of your runnable. The management code required is a bit interesting, but probably easier to write than what it takes to learn Storm from scratch. So you'd scale by adding more instances of the Runnable on more threads.
2) Whether you use Storm or your own multi-threaded solution, you'll still have the problem of how to manage the offset in Kafka. The short answer there is that you'll have to do your own complex offset management. Not only will you have to persist the offset of the last message you read from Kafka, but you'll have to persist and manage the list of in-flight messages currently being processed. In this way, if your app goes down, you know what messages were being processed and you can retrieve and re-process them when you start back up. The base Kafka offset persistence doesn't support this more complex need, but it's only there as a convenience for the simpler use cases anyway. You can persist your offsets info anywhere you like (Zookeeper, file system or any data base).

NServiceBus, when are too many message used?

When considering a service in NServiceBus at what point do you start questioning how many messages handled by a service is too much and start to break these into a new service?
Consider the following: I have a sales service which can currently be broken into a few distinct business components, these are sales order validation, sales order processing, purchase order validation and purchase order processing.
There are currently about 20 message handlers and 2 sagas used within this service. My concern is that during high volume traffic from my website this can cause an initial spike in the messages to jump into the hundreds. Considering that the messages need to be processed in the order they are taken off the queue this can cause a delay for the last in the queue ( depending on what processing each message does).
When separating concerns within a service into smaller business components I find this makes things a little easier. Sure, it's a logical separation, but it seems to provide a layer of clarity and understanding. To me it seems it seems an easier option to do this than creating new services where in the end the more services I have the more maintenance I need to do.
Does anyone have any similar concerns to this?
I think you have actually answered you own question :)
As soon as the message volume reaches a point where the lag becomes an issue you could look to instance your endpoint. You do not necessarily need to reduce the number of handlers. You could simply install the service a number of times and have specific message types sent to the relevant endpoint by mapping.
So it becomes a matter of a simple instance installation and some config changes. So you can then either split messages on sending so that messages from a particular source end up on a particular endpoint (maybe priority) or on message type.
I happened to do the same thing on a previous project (not using NServiecBus though) where we needed document conversion messages coming from the UI to be processed ASAP. We simply installed the conversion service again with its own set of queues and changed the UI configuration to send the conversion messages to the new endpoint. The background conversion messages were still going to the previous endpoint. So here the source determined the separation.

Resources