Flink + Kafka: Why am I losing messages? - bigdata

I have written a very simple Flink streaming job which takes data from Kafka using a FlinkKafkaConsumer082.
protected DataStream<String> getKafkaStream(StreamExecutionEnvironment env, String topic) {
Properties result = new Properties();
result.put("bootstrap.servers", getBrokerUrl());
result.put("zookeeper.connect", getZookeeperUrl());
result.put("group.id", getGroup());
return env.addSource(
new FlinkKafkaConsumer082<>(
topic,
new SimpleStringSchema(), result);
}
This works very well and whenever I put something into the topic on Kafka, it is received by my Flink job and processed. Now I tried to see what happens if my Flink Job isn't online for some reason. So I shut down the flink job and kept sending messages to Kafka. Then I started my Flink job again and was expecting that it would process the messages that were sent meanwhile.
However, I got this message:
No prior offsets found for some partitions in topic collector.Customer. Fetched the following start offsets [FetchPartition {partition=0, offset=25}]
So it basically ignored all messages that came since the last shutdown of the Flink job and just started to read at the end of the queue. From the documentation of FlinkKafkaConsumer082 I gathered, that it automatically takes care of synchronizing the processed offsets with the Kafka broker. However that doesn't seem to be the case.
I am using a single-node Kafka installation (the one that comes with the Kafka distribution) with a single-node Zookeper installation (also the one that is bundled with the Kafka distribution).
I suspect it is some kind of misconfiguration or something the like but I really don't know where to start looking. Has anyone else had this issue and maybe solved it?

I found the reason. You need to explicitly enable checkpointing in the StreamExecutionEnvironment to make the Kafka connector write the processed offsets to Zookeeper. If you don't enable it, the Kafka connector will not write the last read offset and it will therefore not be able to resume from there when the collecting Job is restarted. So be sure to write:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(); // <-- this is the important part
Anatoly's suggestion for changing the initial offset is probably still a good idea, in case checkpointing fails for some reason.

https://kafka.apache.org/08/configuration.html
set auto.offset.reset to smallest(by default it's largest)
auto.offset.reset:
What to do when there is no initial offset in Zookeeper or if an
offset is out of range:
smallest : automatically reset the offset to the smallest offset
largest : automatically reset the offset to the largest offset
anything else: throw exception to the consumer.
If this is set to largest, the consumer may lose some messages when the number of partitions, for the topics it subscribes to, changes on the broker. To
prevent data loss during partition addition, set auto.offset.reset to
smallest
Also make sure getGroup() is the same after restart

Related

What are the possible reasons, the consumer leaves the consumer group?

I am struggling to find out the issue, for what the reason the consumer is getting stopped.
The issue is the consumer is getting stopped after a certain time ( around 4:52 sec) But be a able to consume the messages and able to process.
As per my understanding the reason for the consumer to stop is, the consumer will not be able to commit the offset (processing time is more than max.poll.interval.ms) within max.poll.interval.ms.
are there any other reasons ?
Here are my basic consumer properties :
max.poll.records = 2
auto.offset.reset = latest
max.poll.interval.ms = 300000
idle.poll.interval = 60000 (between two polls)
no.of.consumers =1
consumer.group.id = test2
listener.auto.start = true
I see some statements in log -
Received user wakeup,
Raising WakeupException in response to user wakeup,
Executing onLeavePrepare with generation Generation
Can someone help on this ?
Note : We are a consumer to the event hub, and on this connectivity we are seeing this issue.But when we connect to the Kafka we do not see any issues.
#Gary , Can you please help on this?
It seems the kafka configuration side, everything looks fine. The issue we found is at the docker pods, where the health of the pod is not being sent. Because livenessProbe - port is different than the application configuration port (By mistake i have hard coded a different port) . Any way thanks for this forum !!!!

Spring Kafka batch within time window

Spring Boot environment listening to kafka topics(#KafkaListener / #StreamListener)
Configured the listener factory to operate in batch mode:
ConcurrentKafkaListenerContainerFactory # setBatchListener
or via application.properties:
spring.kafka.listener.type=batch
How to configure the framework so that given two numbers: N and T, it will try to fetch N records for the listener but won't wait more than T seconds, like described here: https://doc.akka.io/docs/akka/2.5/stream/operators/Source-or-Flow/groupedWithin.html
Some properties I've looked at:
max-poll-records ensures you won't get more than N numbers in a batch
fetch-min-size get at least this amount of data in a fetch request
fetch-max-wait but don't wait more than necessary
idleBetweenPolls just sleep a bit between polls :)
It seems like fetch-min-size combined with fetch-max-wait should do it but they compare bytes, not messages/records.
It is obviously possible to implement that by hand, I'm looking whether it's possible to configure Spring to to that for me.
It seems like fetch-min-size combined with fetch-max-wait should do it but they compare bytes, not messages/records.
That is correct, unfortunately, Kafka provides no mechanism such as fetch.min.records.
I don't anticipate that Spring would layer this functionality on top of the kafka-clients; it would be better to ask for a new feature in Kafka itself.
Spring does not manipulate the records returned from the poll at all, except you can now specify subBatchPerPartition to get batches containing just one partition in order to properly support zombie fencing when using exactly once read/prcess/write.

How to resolve celery.backends.rpc.BacklogLimitExceeded error

I am using Celery with Flask after working for a good long while, my celery is showing a celery.backends.rpc.BacklogLimitExceeded error.
My config values are below:
CELERY_BROKER_URL = 'amqp://'
CELERY_TRACK_STARTED = True
CELERY_RESULT_BACKEND = 'rpc'
CELERY_RESULT_PERSISTENT = False
Can anyone explain why the error is appearing and how to resolve it?
I have checked the docs here which doesnt provide any resolution for the issue.
Possibly because your process consuming the results is not keeping up with the process that is producing the results? This can result in a large number of unprocessed results building up - this is the "backlog". When the size of the backlog exceeds an arbitrary limit, BacklogLimitExceeded is raised by celery.
You could try adding more consumers to process the results? Or set a shorter value for the result_expires setting?
The discussion on this closed celery issue may help:
Seems like the database backends would be a much better fit for this purpose.
The amqp/RPC result backends needs to send one message per state update, while for the database based backends (redis, sqla, django, mongodb, cache, etc) every new state update will overwrite the old one.
The "amqp" result backend is not recommended at all since it creates one queue per task, which is required to mimic the database based backends where multiple processes can retrieve the result.
The RPC result backend is preferred for RPC-style calls where only the process that initiated the task can retrieve the result.
But if you want persistent multi-consumer result you should store them in a database.
Using rabbitmq as a broker and redis for results is a great combination, but using an SQL database for results works well too.

How long does Firebase throttle you?

Even with debug enabled for RemoteConfig, I still managed to get the following:
Error fetching remote config values Optional(Error Domain=com.google.remoteconfig.ErrorDomain Code=8002 "(null)"
UserInfo={error_throttled_end_time_seconds=1483110267.054194})
Here is my debug code:
let debug = FIRRemoteConfigSettings(developerModeEnabled: true)
FIRRemoteConfig.remoteConfig().configSettings = debug!
Shouldn't the above prevent throttling?
How long will the throttle error remain in effect?
I've experienced the same error due to throttling. I was calling FIRRemoteConfig.remoteConfig().fetchWithExpirationDuration with an expiry that was less than 60 seconds.
To immediately get around this issue during testing, use an alternative device. The throttling occurs against a particular device. e.g. move from your simulator to a device.
The intention is not to have a single client flooding the server with fetch requests every second. Make sensible use of the caching it offers out of the box and fetch only when necessary.
When you receive this error, plug the value of error_throttled_end_time_seconds into an epoch converter (like this one at https://www.epochconverter.com) and it will tell you the time when throttling ends. I've tested this myself, and the throttling remains in effect for 1 hour from the first moment you are throttled. So either wait an hour or try some of the other recommendations given here.
UPDATE: Also, if you continue making config requests and receive the throttle error, the expire timeout does not increase (i.e. "you are not further penalized").
The quick and easy hack to get your app running is to delete the application and reinstall it. Firebase identifies your device as new device on reinstalling.
Hope it helps and save your time.

How to determine that BizTalk has completed processing a message

We are writing automated system tests for a BizTalk application, but have a problem determining when we can execute the test's verification. We need to be sure that BizTalk has completely processed the message, or message processing has timed out, before the verification.
[Test]
public void ReceiveValidTaskMessageTestShouldBeLoggedInMessageLog()
{
// Exercise
MsmqHelpers.SendMessage(InboundQueueName, ValidMessage);
// Verify
Assert.That(() => GetMessageCount("ReceiveError"), Is.EqualTo(0).After(1000));
Assert.That(() => GetMessageCount("Receive"), Is.EqualTo(1).After(1000));
}
The last two lines check for the existence of a copy of the message in a table in an sql server, one table for successful message, one table for errors.
The problem here is that immediately after sending the message we verify that no message has been placed in the error table. But if BizTalk has not yet processed the message, then that assertion will pass even when it should fail.
What we need is something like this:
[Test]
public void ReceiveValidTaskMessageTestShouldBeLoggedInMessageLog()
{
// Exercise
MsmqHelpers.SendMessage(InboundQueueName, ValidMessage);
// Verify
Assert.That(() => PendingMessages, Is.EqualTo(0).After(1000));
Assert.That(() => GetMessageCount("ReceiveError"), Is.EqualTo(0));
Assert.That(() => GetMessageCount("Receive"), Is.EqualTo(1));
}
Herein lies the problem with automated integration testing.
Such testing is evidence-based, which is reflected in your test's assertions; you are looking for evidence that processing has taken place by check a database.
Similarly, in order to know that processing has finished, you are seeking some evidence that this has happened. For example, theoretically you could run queries against BizTalk message box database to check the state within.
However, BizTalk doesn't lend itself well to this kind of probing as it has not been built with testing in mind (one of it's weaknesses). I certainly wouldn't know how to go about doing this.
A couple of approaches worth considering:
Wait a "reasonable" amount of time before performing the database check to allow BizTalk to finish processing the message.
Have BizTalk output a log file (or some other evidence) just before processing completes which you can check before checking the database.
Even though the approach is limited automated integration testing is incredibly valuable.
A better approach would be to be notified when a record appears in either of those tables and pass/fail the test as appropriate. You could use a rudimentary infinite loop to continuously poll the tables, or a more elegant solution would be to use events - see the event handler delegate for more details.

Resources