Kafka - Dynamic / Arbitrary Partitioning - asynchronous

I'm in the process of building a consumer service for a Kafka topic. Each message contains a url to which my service will make an http request. Each message / url is completely independent from other messages / urls.
The problem I'm worried about is how to handle long-running requests. It's possible for some http requests to take 50+ minutes before a response is returned. During that time, I do not want to hold up any other messages.
What is the best way to parallelize this operation?
I know that Kafka's approach to parallelism is to create partitions. However, from what I've read, it seems that you need to define the number of partitions up front when I really want an infinite or dynamic number of partitions (ideally each message gets its own partition created on the fly)
As an example, let's say I create 1,000 partitions. If 1,001+ messages are produced to my topic, the first 1,000 requests will be made but every message after that will be queued up until the previous request in that partition finishes.
I've thought about making the http requests asynchronous but then I seem to run into a problem when determining what offset to commit.
For instance, on a single partition I can have a consumer read the first message and make an async request. It provides a callback function which commits that offset to Kafka. While that request is waiting, my consumer reads the next message and makes another async request. If that request finishes before the first it will commit that offset. Now, what happens if the first request fails for some reason or my consumer process dies? If I've already committed a higher offset, it sounds like this means my first message will never get reprocessed, which is not what I want.
I'm clearly missing something when it comes to long-running, asynchronous message processing using Kafka. Has anyone experienced a similar issue or have thoughts on how to best solve this? Thanks in advance for taking the time to read this.

You should look at Apache Storm for the processing portion of your consumer and leave the message storage and retrieval to Kafka. What you've described is a very common use case in Big Data (although the 50+ minute thing is a bit extreme). In short, you'll have a small number of partitions for your topic and let Storm stream processing scale the number of components ("bolts" in Storm-speak) that would actual make the http requests. A single spout (the kind of storm component that reads data from an external source) could read the messages from the Kafka topic and stream them to the processing bolts.
I've posted an open source example of how to write a Storm/Kafka application on github.
Some follow-on thoughts to this answer:
1) While I think Storm is the correct platform approach to take, there's no reason you couldn't roll your own by writing a Runnable that performs the http call and then write some more code to make a single Kafka consumer read messages and process them with multiply-threaded instances of your runnable. The management code required is a bit interesting, but probably easier to write than what it takes to learn Storm from scratch. So you'd scale by adding more instances of the Runnable on more threads.
2) Whether you use Storm or your own multi-threaded solution, you'll still have the problem of how to manage the offset in Kafka. The short answer there is that you'll have to do your own complex offset management. Not only will you have to persist the offset of the last message you read from Kafka, but you'll have to persist and manage the list of in-flight messages currently being processed. In this way, if your app goes down, you know what messages were being processed and you can retrieve and re-process them when you start back up. The base Kafka offset persistence doesn't support this more complex need, but it's only there as a convenience for the simpler use cases anyway. You can persist your offsets info anywhere you like (Zookeeper, file system or any data base).

Related

How to Improve Performance of Kafka Producer when used in Synchronous Mode

I have developed a Kafka version : 0.9.0.1 application that cannot afford to lose any messages.
I have a constraint that the messages must be consumed in the correct sequence.
To ensure I do not loose any messages I have implemented Retries within my application code and configured my Producer to ack=all.
To enforce exception handling and to Fail Fast I immediately get() on the returned Future from Producer.send(), e.g.
final Future<RecordMetadata> futureRecordMetadata = KAFKA_PRODUCER.send(producerRecord);
futureRecordMetadata.get();
This approach works fine for guaranteeing the delivery of all messages, however the performance is completely unacceptable.
For example it takes 34 minutes to send 152,125 messages with ack=all.
When I comment out the futureRecordMetadata.get(), I can send 1,089,125 messages in 7 minutes.
When I change ack=all to ack=1 I can send 815,038 in 30 minutes. Why is there such a big difference between ack=all and ack=1?
However by not blocking on the get() I have no way of knowing if the message arrived safely.
I know I can pass a Callback into the send and have Kafka retry for me, however this approach has a drawback that messages may be consumed out of sequence.
I thought request.required.acks config could save the day for me, however when I set any value for it I receive this warning
130 [NamedConnector-Monitor] WARN org.apache.kafka.clients.producer.ProducerConfig - The configuration request.required.acks = -1 was supplied but isn't a known config.
Is it possible to asynchronously send Kafka messages, with a guarantee they will ALWAYS arrive safely and in the correct sequence?
UPDATE 001
Is there anyway I can consume messages in kafka message KEY order direct from the TOPIC?
Or would I have to consume messages in offset order then sort programmatically
to Kafka message Key order?
If you expect a total order, the send performance is bad. (actually total order scenario is very rare).
If Partition order are acceptable, you can use multiple thread producer. One producer/thread for each partition.

WebAPI Lifecycle/Request Queue

I have an AngularJS app that calls WebAPI. If I log the time I initiatiate a request (in my angluar controller) and log the time OnActionExecuting runs (in an action filter in my WebAPI controller), I notice at times a ~2 second gap. I'm assuming nothing else is running before this filter and this is due to requests being blocked/queued. The reason I assume this is because if I remove all my other data calls, I do not see this gap.
What is the number of parallel requests that WebAPI can handle at once? I tried looking at the ASP.NET performance monitors but couldn't find where I could see this data. Can someone shed some insight into this?
There's no straight answer for this but the shortest one is ...
There is no limit to this for WebApi the limits come from what your server can handle and how efficient the code you have it run is.
...
But since you asked, lets consider some basic things that we can assume about our server and our application ...
concurrent connections
A typical server is known for issues like "c10k" ... https://en.wikipedia.org/wiki/C10k_problem ...so that puts a hard limit on the number of concurrent connections.
Assuming each WebApi call is made from say, some AJAX call on a web page, that gives us a limit of around 10k connections before things get evil.
2.Dependency related overheads
If we then consider the complexity of the code in question you may then have a bottleneck in doing things like SQL queries, I have often written WebApi controllers that have business logic that runs 10+ db queries, the overhead here may be your problem?
Feed in Overhead
What about network bandwidth to the server?
Lets assume we are streaming 1MB of data for each call, it wont take long to choke a 1Gb/s ethernet line with messages that size.
Processing Overhead
Assuming you wrote an Api that does complex calculations (e.g mesh generation for complex 3D data) you could easily choke your CPU for some time on each request.
Timeouts
Assuming the server could accept your request and the request was made asynchronously the biggest issue then is, how long are you prepared to wait for your response? Assuming this is quite short you would reduce the number of problems you have time to solve before each request then needed a response.
...
So as you can see, this is by no means an exhaustive list but it outlines the complexity of the question you asked. That said, I would argue that WebApi (the framework) has no limits, it's really down to the infrastructure around it that has limitations in order to determine what can be possible.

How do I specify a redelivery policy and separate retry queue processor in Rebus

I'm currently investigating Rebus but being unable to find good documentation this process is proving difficult. I am hoping someone can help me understand this exciting product.
I have read that during message processing, if something goes wrong the message will return to the queue.
Is the message returned to the front of the queue or placed on the end? If placed on the front this will be problem because the queue in essence becomes blocked with a message that may not be able to be processed - at least until it times out or retries exceeded.
Does Rebus have support for an out-of-the-box separate Retry queue?
Can I specify the interval between retries?
Can I specify an exponential backoff interval for retries as in Apache ActiveMQ?
Thanks
1) The queue transaction is rolled back, effectively moving the message back in front - therefore, it will be immediately retried.
After 5 failed attempts (at least that is the default), Rebus will move the message to the error queue. The default retry mechanism is intentionally very swift - this way, the input queue will never be clogged by poisonous messages.
If you need more sophisticated retries, I suggest you tage a look at bus.Defer - it can defer delivery of a message to the future. It requires that you have a timeout manager(*) running though.
2) I guess that's what I call "error queue", except there's no retry :)
I did create a solution some time, though, where I coded a simple endpoint that would periodically empty the error queue and move all the messages back into the original source queue, as a form of crude automatic second-level retry mechanism.
3) No. NServiceBus has the concept of second-level retries, but this is something that I've never really needed (enough) with Rebus. But with Rebus, you're on your own here - it should be fairly easy to do some intelligent bus.Defer that can then be easily adapted to each kind of error that you're expecting.
4) See (3)
I hope that clarifies a bit :)
(*) The timeout manager can be a separate endpoint whose only job in life is to receive a message, hold on to it for a while (i.e. save it to a database), and then return it to the sender when the time has elapsed. The timeout manager can be hosted in-process though, but using the .Timeouts(t => t.???) configuration spell.

Event Driven Architecture - Service Contract Design

I'm having difficulty conceptualising a requirement I have into something that will fit into our nascent SOA/EDA
We have a component I'll call the Data Downloader. This is a facade for an external data provider that has both high latency and a cost associated with every request. I want to take this component and create a re-usable service out of it with a clear contract definition. It is up to me to decide how that contract should work, however its responsibilities are two-fold:
Maintain the parameter list (called a Download Definition) for an upcoming scheduled download
Manage the technical details of the communication to the external service
Basically, it manages the 'how' of the communication. The 'what' and the 'when' are the responsibilities of two other components:
The 'what' is managed by 'Clients' who are responsible for
determining the parameters for the download.
The 'when' is managed by a dedicated scheduling component. Because of the cost associated with the downloads we'd like to batch the requests intraday.
Hopefully this sequence diagram explains the responsibilities of the services:
Because each of the responsibilities are split out in three different components, we get all sorts of potential race conditions with async messaging. For instance when the Scheduler tells the Downloader to do its work, because the 'Append to Download Definition' command is asynchronous, there is no guarantee that the pending requests from Client A have actually been serviced. But this all screams high-coupling to me; why should the Scheduler necessarily know about any 'prerequisite' client requests that need to have been actioned before it can invoke a download?
Some potential solutions we've toyed with:
Make the 'Append to Download Definition' command a blocking request/response operation. But this then breaks the perf. and scalability benefits of having an EDA
Build something in the Downloader to ensure that it only runs when there are no pending commands in its incoming request queue. But that then introduces a dependency on the underlying messaging infrastructure which I don't like either.
Makes me think I'm thinking about this problem in a completely backward way. Or is this just a classic case of someone trying to fit a synchronous RPC requirement into an async event-driven architecture?
The thing I like most about EDA and SOA, is that it almost completely eliminates the notion of race condition. As long as your events are associated with some association key (e.g. downloadId), the problem you describe can be addressed with several solutions of different complexities - depending on your needs. I'm not sure I totally understand the described use-case but I will try my best
Out of the top of my head:
DataDownloader maintains a list of received Download Definitions and a list of triggered downloads. When a definition is received it is checked against the triggers list to see if the associated download has already been triggered, and if it was, execute the download. When a TriggerDownloadCommand is recieved, the definitions list is checked against a definition with the associated downloadId.
For more complex situation, consider using the Saga pattern, which is implemented by some 3rd party messaging infrastructures. With some simple configuration, it will handle both messages, and initiate the actual download when the required condition is satisfied. This is more appropriate for distributed systems, where an in-memory collection is out of the question.
You can also configure your scheduler (or the trigger command handler) to retry when an error is signaled (e.g. by an exception), in order to avoid that race condition, and ultimately give up after a specified timeout.
Does this help?

NServiceBus, when are too many message used?

When considering a service in NServiceBus at what point do you start questioning how many messages handled by a service is too much and start to break these into a new service?
Consider the following: I have a sales service which can currently be broken into a few distinct business components, these are sales order validation, sales order processing, purchase order validation and purchase order processing.
There are currently about 20 message handlers and 2 sagas used within this service. My concern is that during high volume traffic from my website this can cause an initial spike in the messages to jump into the hundreds. Considering that the messages need to be processed in the order they are taken off the queue this can cause a delay for the last in the queue ( depending on what processing each message does).
When separating concerns within a service into smaller business components I find this makes things a little easier. Sure, it's a logical separation, but it seems to provide a layer of clarity and understanding. To me it seems it seems an easier option to do this than creating new services where in the end the more services I have the more maintenance I need to do.
Does anyone have any similar concerns to this?
I think you have actually answered you own question :)
As soon as the message volume reaches a point where the lag becomes an issue you could look to instance your endpoint. You do not necessarily need to reduce the number of handlers. You could simply install the service a number of times and have specific message types sent to the relevant endpoint by mapping.
So it becomes a matter of a simple instance installation and some config changes. So you can then either split messages on sending so that messages from a particular source end up on a particular endpoint (maybe priority) or on message type.
I happened to do the same thing on a previous project (not using NServiecBus though) where we needed document conversion messages coming from the UI to be processed ASAP. We simply installed the conversion service again with its own set of queues and changed the UI configuration to send the conversion messages to the new endpoint. The background conversion messages were still going to the previous endpoint. So here the source determined the separation.

Resources