Spring Kafka batch within time window - spring-kafka

Spring Boot environment listening to kafka topics(#KafkaListener / #StreamListener)
Configured the listener factory to operate in batch mode:
ConcurrentKafkaListenerContainerFactory # setBatchListener
or via application.properties:
spring.kafka.listener.type=batch
How to configure the framework so that given two numbers: N and T, it will try to fetch N records for the listener but won't wait more than T seconds, like described here: https://doc.akka.io/docs/akka/2.5/stream/operators/Source-or-Flow/groupedWithin.html
Some properties I've looked at:
max-poll-records ensures you won't get more than N numbers in a batch
fetch-min-size get at least this amount of data in a fetch request
fetch-max-wait but don't wait more than necessary
idleBetweenPolls just sleep a bit between polls :)
It seems like fetch-min-size combined with fetch-max-wait should do it but they compare bytes, not messages/records.
It is obviously possible to implement that by hand, I'm looking whether it's possible to configure Spring to to that for me.

It seems like fetch-min-size combined with fetch-max-wait should do it but they compare bytes, not messages/records.
That is correct, unfortunately, Kafka provides no mechanism such as fetch.min.records.
I don't anticipate that Spring would layer this functionality on top of the kafka-clients; it would be better to ask for a new feature in Kafka itself.
Spring does not manipulate the records returned from the poll at all, except you can now specify subBatchPerPartition to get batches containing just one partition in order to properly support zombie fencing when using exactly once read/prcess/write.

Related

Make scheduler run in only one instance of multiple micro-service

I have built a micro-service where there is an API called deleteToken. This API(when invoked) is supposed to change the status in a tuple in db corresponding to token (identified with token id) to "MARK-DELETE". Once that tuple has status "MARK_DELETE" then after 30 days there should be a rest call made to downstream service API called deleteTokenFromPartner. There is no such mandate like call to deleteTokenFromPartner has to be made right after 30 days, it can be done few hours later 30 days also. So what I thought was I will write a scheduler (using Quartz, Java Executor service) with scheduled period in such a way that it will run once everyday. what it will do is it will query db and find out all rows which has status="MARK_DELETE" and status update is older than 30 days. After then it will iteratively call deleteTokenFromPartner for each and every row. There is one db which is highly available and we may not have any issue with consistency as we delete after 30 days. But the problem I am seeing is, as this is a micro-service which has N instances so every instance will query db, get the same set of rows and make call to same rows. Can I make any tweak so that this duplicated calls can be avoided. FYI we don't make any config changes using hostnames and if only one instance will be capable of running the scheduler that too will be fine.

How to resolve celery.backends.rpc.BacklogLimitExceeded error

I am using Celery with Flask after working for a good long while, my celery is showing a celery.backends.rpc.BacklogLimitExceeded error.
My config values are below:
CELERY_BROKER_URL = 'amqp://'
CELERY_TRACK_STARTED = True
CELERY_RESULT_BACKEND = 'rpc'
CELERY_RESULT_PERSISTENT = False
Can anyone explain why the error is appearing and how to resolve it?
I have checked the docs here which doesnt provide any resolution for the issue.
Possibly because your process consuming the results is not keeping up with the process that is producing the results? This can result in a large number of unprocessed results building up - this is the "backlog". When the size of the backlog exceeds an arbitrary limit, BacklogLimitExceeded is raised by celery.
You could try adding more consumers to process the results? Or set a shorter value for the result_expires setting?
The discussion on this closed celery issue may help:
Seems like the database backends would be a much better fit for this purpose.
The amqp/RPC result backends needs to send one message per state update, while for the database based backends (redis, sqla, django, mongodb, cache, etc) every new state update will overwrite the old one.
The "amqp" result backend is not recommended at all since it creates one queue per task, which is required to mimic the database based backends where multiple processes can retrieve the result.
The RPC result backend is preferred for RPC-style calls where only the process that initiated the task can retrieve the result.
But if you want persistent multi-consumer result you should store them in a database.
Using rabbitmq as a broker and redis for results is a great combination, but using an SQL database for results works well too.

Flink + Kafka: Why am I losing messages?

I have written a very simple Flink streaming job which takes data from Kafka using a FlinkKafkaConsumer082.
protected DataStream<String> getKafkaStream(StreamExecutionEnvironment env, String topic) {
Properties result = new Properties();
result.put("bootstrap.servers", getBrokerUrl());
result.put("zookeeper.connect", getZookeeperUrl());
result.put("group.id", getGroup());
return env.addSource(
new FlinkKafkaConsumer082<>(
topic,
new SimpleStringSchema(), result);
}
This works very well and whenever I put something into the topic on Kafka, it is received by my Flink job and processed. Now I tried to see what happens if my Flink Job isn't online for some reason. So I shut down the flink job and kept sending messages to Kafka. Then I started my Flink job again and was expecting that it would process the messages that were sent meanwhile.
However, I got this message:
No prior offsets found for some partitions in topic collector.Customer. Fetched the following start offsets [FetchPartition {partition=0, offset=25}]
So it basically ignored all messages that came since the last shutdown of the Flink job and just started to read at the end of the queue. From the documentation of FlinkKafkaConsumer082 I gathered, that it automatically takes care of synchronizing the processed offsets with the Kafka broker. However that doesn't seem to be the case.
I am using a single-node Kafka installation (the one that comes with the Kafka distribution) with a single-node Zookeper installation (also the one that is bundled with the Kafka distribution).
I suspect it is some kind of misconfiguration or something the like but I really don't know where to start looking. Has anyone else had this issue and maybe solved it?
I found the reason. You need to explicitly enable checkpointing in the StreamExecutionEnvironment to make the Kafka connector write the processed offsets to Zookeeper. If you don't enable it, the Kafka connector will not write the last read offset and it will therefore not be able to resume from there when the collecting Job is restarted. So be sure to write:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(); // <-- this is the important part
Anatoly's suggestion for changing the initial offset is probably still a good idea, in case checkpointing fails for some reason.
https://kafka.apache.org/08/configuration.html
set auto.offset.reset to smallest(by default it's largest)
auto.offset.reset:
What to do when there is no initial offset in Zookeeper or if an
offset is out of range:
smallest : automatically reset the offset to the smallest offset
largest : automatically reset the offset to the largest offset
anything else: throw exception to the consumer.
If this is set to largest, the consumer may lose some messages when the number of partitions, for the topics it subscribes to, changes on the broker. To
prevent data loss during partition addition, set auto.offset.reset to
smallest
Also make sure getGroup() is the same after restart

Load balanced Fiware Orion

I just created a dockerized load balanced version of OCB using Nginx and supervisord running separate instances of Orion balanced by Nginx. Only for testing purposes.
My question is if I use this approach, would I have some troubles with TIMEINTERVAL subscriptions? (I don't want 'n' notifications for each OCB process).
Any help will be sure appreciated.
Current Orion version (0.23.0) works in the following way: at creation time, the ONTIMEINTERVAL subscribeContext is dispatched by the LB to one of the CB nodes, which creates a permantent thread in charge of sending notification messages at the notification frequency.
However, there are two kind of problems:
If the client wants to cancel the subscription sending unsubscribeContext, that request could be recived by a CB not managing the subscription. Thus, the operation may result in the subscription being deleted from DB, but the notification continues being sent.
Let's consider that in a given moment CB1 managed subscriptions S1 and S2 and CB2 managed S3 and S4. Let's consider that CB2 fails and that it is restarted. The CB2 will "see" four subscription (S1, S2, S3 and S4) at starting time, thus 4 threads are created and the final result is that S3 and S4 notifications are duplicated (being sent at the same time by CB1 and CB2).
Thus, in sum ONTIMEINTERVAL subscription are discouraged in HA and/or horizontal scaling scenarios. However, note that all use cases based on ONTIMEINTERVAL can be "reversed" running a queryContext-based polling at the same frequency at the notification receptor, so it doesn't use to be a big problem.
EDIT: ONTIMEINTERVAL subscription were removed in Orion 1.0.0. ONTIMEINTERVAL subscriptions had several problems (as the ones described in the above answer). Actually, they aren't really needed, as any use case based on ONTIMEINTERVAL notification can be converted to an equivalent use case in which the receptor runs queryContext at the same frequency (and taking advantage of the features of queryContext, such as pagination or filtering)

BizTalk 2009 - How to run a process after all messages have processed from a large disassembled file

We receive many large data files daily in a variety of formats (i.e. CSV, Excel, XML, etc.). In order to process these large files we transform the incoming data into one of our standard 'collection' message classes (using XSLT and a pipeline component - either built-in or custom), disassemble the large transformed message into individual 'object' messages and then call a series of SOAP web service methods to handle business logic and database operations.
Unlike other files received, the latest file will contain all data rows each day and therefore, we have to handle the differences to prevent identical records from being re-processed each day.
I have a suitable mechanism for handling inserts and updates but am currently struggling with the deletes (where the record exists in the database but not in the latest file).
My current thought process is to flag the deleted records in the database using a 'cleanup' task at the end of the entire process but this would require a method to be called once all 'object' messages from the disassembled file have completed.
Is it possible to monitor individual messages from a multi-record file and call a method on completion of the whole file? Currently, all research is pointing to an orchestration with some sort of 'wait' but is this the only option?
Example: File contains 100 vehicle records. This is disassembled into 100 individual XML messages which are processed using 100 calls to a web service method. Wish to call cleanup operation when all 100 messages are complete.
The best way I've found to handle the 'all rows every day' scenario is to pre-stage the data in SQL Server where it's easier to compare the 'current' set to the 'previous' set. The INTERSECT and EXCEPT operators make it pretty easy in most cases.
Then drain the records with a Polling statement.
The component that does the de-batching would need to publish a start of batch message with the number of individual records and a correlation key.
The components that do the insert & update would need to publish a completion message with the same correlation key when it is completed processing.
The start of batch message would have spun up an Orchestration that would would listen for the completion messages with that correlation key and count the number, and either after it has received the correct number or after a timeout period it would call the cleanup or raise an exception.

Resources