Seek to an offset via an external trigger - spring-kafka

Currently I use the AcknoledgingMessageListener to implement a Kafka consumer using spring-Kafka. This implementation helps me listen on a specific topic and process messages with a manual ack.
I now need to build the following capability:
Let us assume that for an some environmental exception or some entry of bad data via this topic, I need to replay data on a topic from and to a specific offset. This would be a manual trigger (mostly via the execution of a Java class).
It would be ideal if I can retrieve the messages between those offsets and feed it is a replay topic so that a new consumer can process those messages thus keeping the offsets intact on the original topic.
CosumerSeekAware interface - if this is the answer how can I trigger this externally? Via let say a mvn -Dexec. I am not sure if this is even possible
Also let say that I have an crash time stamp with me, is it possible to introspect the topic to find the offset corresponding to the crash so that I can replay from that offset?
Can I find offsets corresponding to some specific data so that I can replay those specific offsets?
All of these requirements are towards building a resilience layer around our Kafka capabilities. I need all of these to be managed by a separate executable class that can be triggered manually providing the relevant data (like time stamps etc). This class should determine offsets and then seek to that offset, retrieve the messages corresponding to those offsets and post them to a separate topic. Can someone please point me in the right direction? I’m afraid I’m going around in circles.

so that a new consumer can process those messages thus keeping the offsets intact on the original topic.
Just create a new listener container with a different group id (new consumer) and use a ConsumerAwareRebalanceListener (or ConsumerSeekAware) to perform the seeks when the partitions are assigned.
Here is a sample CARL that seeks all assigned topics based on a timestamp.
You will need some mechanism to know when the new consumer should stop consuming (at which time you can stop() the new container). Maybe set max.poll.records=1 on the new consumer so he doesn't prefetch past the failure point.
I am not sure what you mean by #3.

Related

How do I guarantee task order processing for a queue with multiple consumers in RabbitMQ?

Say I want to start friendship between A and B.
Say I want to end friendship between A and B.
Those are two tasks I want to send to a queue having multiple consumers (workers).
I want to guarantee processing order so, how to avoid the second task to be performed before the first?
My solution: make tasks sticky (tasks about A are always sent to the same consumer).
Implementation: use RabbitMQ's exchanges and map tasks to the available consumers.
How do I map A to its consumer? I'm thinking about nginx's ip_hash. I think I need something similar.
I don't know if it is relevant but A and B are uuid.v4() UUIDs.
Can you point me out to the algorithm I need to accomplish mapping, please?
Well, there are two options:
make one exchange / queue for all events and guarantee that they're gonna be inserted in proper order. Create one worker for them. This costs more on inserting data (and doesn't give you option of scalability).
prepare your app for such situation, e.g. when you get message destroyFriendship and friendship does not exist - save message to db containing future friendship ending. Then you can have multiple workers making and destroying friendship and do not have to care about proper order. Simply do your job, make friends and if there's row in db about ending of friendship - destroy it (or simply do not create). Of course you need to check timestamp of creation/destroying time and check if destroying time was after creation time!
Of course you can count somehow hash of A/B, but it would be IMO more costfull then preparing app. Scalling app using excahnges/queues is not really good - you're going to create more and more queues and it's going to end up in too many queues/exchanges in rabbitmq.
If you have to use solution you specified - you can for example count crc32 from A and B, and using it's value calcalate to which queue task should be send. But having multiple consumers might result wrong here - what if one of consumers is blocked somehow and other receive message with destroying friendship? Using this solution I'd say that it's dangerous to have more than 1 worker per group of A/B.

Storm best topology design patterns for I/O tasks

So, I have a framework where the input is a Kafka queue of Tweet documents. My topology needs to read it and hit three different external APIs.
I need a way to make sure all three are done before moving forward. I don't think a BatchBolt is a good solution, is it? Can anybody help with this?
Edit / Clarification
The three API hits need to return the results. I would need to process these responses before the document is passed over to the next bolt.
If I understand you right. You have kafka spout emmiting messages. And each message need to be delivered in three different systems.
This could be easely done configuring topology.
TopologyBuilder topologyBuilder = new TopologyBuilder ();
topologyBuilder.setSpout("Generator", spout, 1);
topologyBuilder.setBolt("External1", bolt1, 1).localOrShuffleGrouping("Generator");
topologyBuilder.setBolt("External2", bolt2, 1).localOrShuffleGrouping("Generator");
topologyBuilder.setBolt("External3", bolt3, 1).localOrShuffleGrouping("Generator");
In this simple example 3 different bolt connected to 1 kafka spout. When spout emit message it goes to all three bolts same time. If one of the bolts don't acknoledge tuple it will fail, and reemited to all of these bolts.

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.
Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...
tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.
The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

Starting mutliple orchestrations from parent orchestration and passing messages to them

I have a situation where a main orchestration is responsible for processing a convoy of messages. These messages belong to a set of customers, the orchestration will read the messages as they come in, and for each new customer id it finds, it will spin up a new orchestration that is responsible for processing the messages of a particular customer. I have to preserve the order of messages as they come in, so the newly created orchestrations should process the message it has and wait for additional messages from the main orchestration.
Tried different ways to tackle this, but was not able to successfuly implement it.
I would like to hear your opinions on how this could be done.
Thanks.
It sounds like what you want is a set of nested convoys. While it might be possible to get that working, it's going to... well, hurt. In particular, my first worry would be maintenance: any changes to the process would be a pain in the neck to make, and, much worse, deployment would really, really suck.
Personally, I would really try to find an alternative way to implement this and avoid the convoys if possible, but that would depend a lot on your specific scenario.
A few questions, if you don't mind:
What are your ordering requirements? For example, do you only need ordered processing for each customer on a single incoming batch, or across batches? If the latter, could you make do without the master orchestration and just force a single convoy'd instance per customer? Still not great, but would likely simplify things a lot.
What are you failure requirements with respect to ordering? Should it completely stop processing? Save message and keep going? What about retries?
Is ordering based purely on the arrival time of the message? Is there anything in the message that you could use to force ordering internally instead of relying purely on the arrival time?
What does the processing of the individual messages do? Is the ordering requirement only to ensure that certain preconditions are met when a specific message is processed (for example, messages represent some tree structure that requires parents are processed before children).
I don't think you need a master orchestration to start up the sub-orchestrations. I am assumin you are not talking about the master orchestration implmenting a convoy pattern. So, if that's the case, here's what I might do.
There is a brief example here on how to implment a singleton orchestration. This example shows you how to setup an orchestration that will only ever exist once. All the messages going to it will be lined up in order of receipt and processed one at a time. Your example differs in that you want to have this done by customer ID. This is pretty simple. Promote the customer ID in the inbound message and add it to the correlation type. Now, there will only ever be one instance of the orchestration per customer.
The problem with singletons is this. You have to kill them at some point or they will live forever as dehydrated orchestrations. So, you need to have them end. You can do this if there is a way for the last message for a given customer to signal the orchestration that it's time to die through an attribute or such. If this is not possible, then you need to set a timer. If no messags are received in x seconds, terminate the orch. This is all easy to do, but it can introduce Zombies. Zombies occur when that orchestration is in the process of being shut down when another message for that customer comes in. this can usually be solved by tweeking the time to wait. Regardless, it will cause the occasional Zombie.
A note fromt he field. We've done this and it's really not a great long term solution. We were receiving customer info updates and we had to ensure ordered processing. We did this singleton approach and it's been problematic from the Zombie issue and the exeption issue. If the Singleton orchestration throws an exception, it will block the processing for a all future messages for that customer. So - handle every single possible exception. The real solution would have been to have the far end system check the time stamps from the update messages and discard ones that were older than the last update. We wanted to go this way, but the receiving system didn't want to do this extra work.

Biztalk Ordered Delivery failure

We have a BizTalk application where the order of messages being inputted is very important and has to be kept, meaning they have to be outputted in the same order. Normally ordered delivery would do the trick here.
However I read that ordered delivery is only guaranteed when you connect a receive location directly to a send port. The moment you use orchestrations the order delivery isn't guaranteed anymore. Is there a way to work around or fix this? Because this kind of ruins our whole application and we've been working on this for months.
I read a work around from Microsoft where they use an extra field which has a counter and where they use an end orchestration which checks the counters. But this is way too much work for us to do now. So this work around is a no go. Plus not all messages are translated which creates holes in our flow and not all messages are coming from the same source either which makes this work around useless anyway.
Any other ideas?
Check out this page.
It explains that if you have an orchestration that follows the singleton pattern to ensure only one instance of the orchestration exists, and you make sure you set the orchestration's receive port to ordered delivery, than you should get a valid end-to-end ordered delivery scenario
To provide end-to-end ordered delivery the following conditions must be met:
Messages must be received with an adapter that preserves the order of the messages when submitting them to BizTalk Server. In BizTalk Server 2006, examples of such adapters are MSMQ, MQSeries, and MSMQT. In addition, HTTP or SOAP adapters can be used to submit messages in order, but in that case the HTTP or SOAP client needs to enforce the order by submitting messages one at a time.
You must subscribe to these messages with a send port that has the Ordered Delivery option set to True.
If an orchestration is used to process the messages, only a single instance of the orchestration should be used, the orchestration should be configured to use a sequential convoy, and the Ordered Delivery property of the orchestration's receive port should be set to True.
Resequencing strategies for ordered delivery in BizTalk:
I recently responded to a LinkedIn user's question regarding ordered delivery options in BizTalk.
I thought it would be useful for people to understand some of the strategies for re-sequencing messages using BizTalk.
Often as an BizTalk Developer, you are required to integrate to line-of-business systems which are unchangeable. This can be for one or more of many different reasons. As an example, the cost of changing a system can be too high or the vendor license states that support may be withdrawn if the system is changed.
This would not normally represent a problem where the vendor has provided a thoughtfully designed API as a point-of-integration endpoint. However, as many Integration Developers quickly learn, this is very rarely the case.
What do I mean by a thoughtfully designed API? Well, aside from all the SODA principals (service composition, fault contracts etc.), the most important feature of an API is to support the consumption of data which arrives in the wrong order.
This is a fairly simple thing to do. For example, if you are a vendor and you provide a HTTP operation as your integration point then one of the fields you could expose on your operation is a time-stamp or, even better, a sequence number. This means that if a call is made with an out-of-date payload, the relevant compensating mechanism can kick-in - which can be as simple as discarding the data.
This article discusses what to do when the vendor has not built this feature into an API, and as an integrator you therefore are forced to implement end-to-end ordered delivery as part of your integration solution.
As stated in my response to the user's post on LinkedIn (see link above), in BizTalk ordered delivery in any but the simplest of scenarios is complicated at best and at worst can represent a huge cost in increased complexity, both in terms of development and support. The basic reason is that BizTalk is designed to be massively concurrent to enable high throughput, and there is a direct and unavoidable conflict between concurrency and ordering. Shoe-horning E2E ordered delivery into a BizTalk solution relies on artefacts such as singleton orchestrations which introduce complexity and increase both failure rate and cost-per-failure numbers.
A far better solution is to maintain concurrent processing to as near as possible to the line-of-business system endpoints, and then implement what is called a re-sequencer wrapper around each of the endpoints which require data to be delivered in the correct order.
How to implement such a wrapper in BizTalk depends on some factors, which are outlined in the following table:
|Sequencing |Messages|Database |Wrapper |
|field |are |integration?|strategy |
| |deltas? | | |
|--------------|--------|------------|----------------------------------|
|n of a total m| N | Y |Stored procedure |
|n of a total m| N | N |Singleton orchestration |
|n of a total m| Y | Y |Batched singleton orchestration |
|n of a total m| Y | N |Batched singleton orchestration |
|Timestamp | N | Y |Stored procedure |
|Timestamp | N | N |Singleton orchestration |
|Timestamp | Y | Y |Buffer table with staggered reader|
|Timestamp | Y | N |Buffer table with staggered reader|
The first factor Sequencing field relates to the idea that in order to implement any kind of re-sequencer wrapper, as a minimum you will require that your message data contains some sequencing information. This can take the form of a source time-stamp; however a better, though rarer, kind of sequencing information consists of a sequence number combined with the total number of messages, for example, message 1 of 10, 2 of 10, etc..
The second factor Messages are deltas? relates to whether or not the payload of your message contains a single state change to the data or the sum of all past changes to the data. Put another way, is it possible to reconstruct the full current state of the data from this message? If the message payload contains just a single change then it may not be possible to reconstruct the state of the data from the single message, and in this instance your message is a delta.
The third factor Database integration? relates to whether or not the integration-entry-point to a system is a database. The reason this matters is that integrating at the database layer is a fairly common integration scenario, and if available can greatly simplify handling re-sequencing.
The strategies from the above table are described in detail below:
Stored procedure wrapper
This is the simplest of the resequencing strategies. A new stored procedure is created which queries the target data before making a decision about whether to update the target data. The decision can be as simple as Is the data I have newer than the data in the target system?
Of course, in order to implement this strategy, the target data also has to include the sequencing field of the source data, although an approximation can be made if necessary by relying on existing time-stamps which may already exist in the target data. The stored procedure wrapper can be contained either in the target database or ideally in a separate database.
Singleton orchestration wrapper
The idea behind this strategy is the singleton orchestration. This is a pattern you can implement to ensure that only a single instance of the orchestration will exist at any one time. There are many articles on the web demonstrating how to implement this pattern in BizTalk.
The core of the idea is that the singleton simply keeps a track of the most recent successfully processed message sequence (or time-stamp). If the singleton receives a message which is older than the most recent sequence it is simply discarded. This works because the messages are non-deltas, so the target system can commit only the most recent of a number of messages and the data will be in the most recent state. Only when data is committed successfully is the most recent sequence held by the singleton updated.
Batched singleton orchestration wrapper
This strategy is based on the Singleton orchestration wrapper above, except it is more complex. Rather than only keep the most recent sequence information in memory the singleton is required to create and hold a working set of messages in memory which it will re-order and then process once all expected messages from the batch have arrived. This is because the messages are deltas so the target system MUST receive each message in the order they were intended. Once the batch has been sent successfully the singleton can terminate.
To do this it is a requisite of the source data that it contain a correlation identifier of some description which allows the batch of messages to be defined. For example, processing a defined set of orders from a customer, the inbound messages must contain an identifier for the customer. This can then be used to route the messages to the singleton orchestration instance correlated with this customer. Furthermore the message sequence field available must be of the n of a total m form.
Once the singleton is initialised it assembles a working set of messages in memory and proceeds to populate it as new messages arrive. One way I have seen this done is using a System.Collections.Generic.List as the container for the working set. Once the list has been fully populated (list length = m) then it is assumed all messages in the batch have been received and the orchestration then loops over the working set in sequence and processes the messages into the target system.
One of the benefits of the batched singleton orchestration wrapper is it allows concurrent processing by correlation identifier. In the example above this means that messages from two customers would be processed concurrently.
Buffer table with staggered reader wrapper
Arguably the most complex of the strategies presented, this solution is to be used when you have delta messaging with a time-stamp-based sequencing field. It can be implemented with a database of some description which acts as a re-sequencing buffer.
It is worth noting here that this re-sequencing wrapper does not guarantee ordered delivery, but used well it makes ordered delivery highly likely.
As messages arrive, they are written into the buffer and in the same operation the buffer is reordered, so that the order of messages held in the buffer are always correct.
To create the buffer reader, have a receive location which reads the messages in the buffer before passing the messages to a send port with ordered delivery enabled, which then will process the messages into the target system. You can also use a singleton orchestration as an intermediary if your target system's API semantics are too complex for a send port.
However, using this wrapper as I have described it above will not enable ordered delivery, as the messages will almost certainly be committed to the buffer in the wrong order, which will result in the messages being processed into the target system in the same (wrong) order. This is where the staggered query comes in. This is a fancy way of saying your buffer query needs to only select data at intervals of time T, AND only select those rows where the row-number is lower than buffer total row count minus C.
This has the effect of allowing sequencing to occur over an appropriate timespan. T will be familiar to most BizTalk developers as the polling interval of some adapters (such as the WCF-SQL adapter). C is slightly more difficult to set, but by increasing this number you are reducing the chances that when you poll, you will miss a message older than the most recent one in your retrieved data set.
What T and C are depends on many things, although these values should be based on your latency SLA and your message volume (or throughput). As a guideline, if you have a SLA to deliver data into your target system within 30 seconds and you process 10 messages per second then T should be around 10 seconds and C should be around 100 rows.
Of course this only works if your messages for a given correlation id are sent by the source system during a short space of time (ideally back-to-back). The longer the interval between sends, the greater the required value of C, and the less effective the wrapper becomes.
One of the benefits of this strategy is you can also perform de-duplication of messages in the buffer if your data source is prone to sending duplicate messages and your target system endpoint is not idempotent. You can also use the buffer to implement FILO and other non-standard queueing semantics.
Conclusions
The strategies I have discussed here are ways of bending BizTalk to a task which is wasn't designed to do. As a result each has caveats around cost and complexity to support, and also may not work in certain scenarios. I would like to hear from anyone who has implemented other patterns for ordered delivery in BizTalk.

Resources