Reactor Kafka: Consumption is stopped without any explicit errors - spring-kafka

We have seen issues intermittently where one of our application nodes stops consuming messages, thereby building up consumer lag in a few of the Kafka topic partitions. Eventually, all of our application nodes stop consuming messages, and consumer lags build up on all of the partitions.
At this time, we have verified that at the Kafka broker, our consumer threads were reported to be alive and healthy. When we restart one of the nodes, all the other nodes that are not consuming messages start consuming messages again. When we check the logs in these nodes, we see a message Attempt to heartbeat failed since the group is rebalancing, which indicates rebalancing and assignment of new partitions.
The theory we have so far is that due to transient network exceptions, the consumer gets into a state where the consumption is paused indefinitely, but the heartbeat is successful. Is this a possible scenario? We weren’t able to reproduce this behavior locally.
The exception that is seen before the consumption is stopped:
2022-09-23|05:41:55.414 DEBUG r.k.r.internals.ConsumerEventLoop - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= Resumed
2022-09-23|05:41:56.835 DEBUG r.k.r.internals.ConsumerEventLoop - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= Emitting 1 records, requested now 1
2022-09-23|05:41:56.835 DEBUG r.k.r.internals.ConsumerEventLoop - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= onRequest.toAdd 1, paused false
2022-09-23|05:41:56.835 DEBUG r.k.r.internals.ConsumerEventLoop - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= Paused - too many deferred commits
2022-09-23|05:41:56.835 DEBUG r.k.r.internals.ConsumerEventLoop - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= Consumer woken
2022-09-23|05:42:13.082 INFO o.a.k.clients.FetchSessionHandler - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= [Consumer clientId=consumer-runtime-processor-1, groupId=runtime-processor-prd] Error sending fetch request (sessionId=214139410, epoch=1) to node 208:
org.apache.kafka.common.errors.DisconnectException: null
2022-09-23|05:42:18.097 INFO o.a.k.clients.FetchSessionHandler - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= [Consumer clientId=consumer-runtime-processor-1, groupId=runtime-processor-prd] Error sending fetch request (sessionId=973107219, epoch=1) to node 308:
org.apache.kafka.common.errors.DisconnectException: null
2022-09-23|05:42:21.106 INFO o.a.k.clients.FetchSessionHandler - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= [Consumer clientId=consumer-runtime-processor-1, groupId=runtime-processor-prd] Error sending fetch request (sessionId=1457615640, epoch=INITIAL) to node 107:
org.apache.kafka.common.errors.DisconnectException: null
There are no logs for the next couple of hours until the restart of a different node. Restart causes the below logs to appear, and the consumer starts consuming messages again:
2022-09-23|07:39:20.019 INFO o.a.k.c.c.i.AbstractCoordinator - thread=reactive-kafka-runtime-processor-1 correlation_id= tid= provider_id= entity_type= userid= realmid= [Consumer clientId=consumer-runtime-processor-1, groupId=runtime-processor-prd] Attempt to heartbeat failed since group is rebalancing
Reactor kafka version: 1.3.11
Auto Acknowledge with maxDeferredCommits were set.
our consumer function looks like below:
public Flux<Message<String>> consumeRecords() {
return reactiveKafkaConsumerTemplate
.receiveAutoAck()
.doOnNext(t -> log.info("On Next call from customer template"))
.doOnComplete(() -> log.info("On Complete call from customer template"))
.publishOn(Schedulers.boundedElastic())
// rate limit flow to start processing n events every t time duration where:
// t - is given by kafkaConsumerRateLimitConfig.getDurationInMillis()
// n - is given by kafkaConsumerRateLimitConfig.getConcurrency()
.flatMap(x -> Mono.just(x)
.delayElement(Duration.ofMillis(1000)),
1
)
.flatMap(consumerRecord ->
Mono.just(consumerRecord)
.map(kafkaRecordMapper::mapConsumerRecordToMessage)
// handle error
.doOnError(t -> {
log.error("AutoAck Exception occurred when consuming kafka consumer record", t);
})
)
.doOnError(throwable -> Mono.just(throwable)
.doOnNext(t -> log.error("AutoAck Kafka consumer template failed -", t))
.subscribe()
)
.retryWhen(Retry.indefinitely())
.repeat();
}
KafkaReceiver Properties
Below are some of the properties we have explicitly set.
auto.commit.interval.ms = 1000
auto.offset.reset = earliest
heartbeat.interval.ms = 1000
security.protocol = SSL
with maxDeferredCommits set to 200
Properties with defaults printed in splunk:
allow.auto.create.topics = true
auto.commit.interval.ms = 1000
auto.offset.reset = earliest
check.crcs = true
client.dns.lookup = use_all_dns_ips
client.id = consumer-consumer-kafkaretry-2
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = consumer-kafkaretry
group.instance.id = null
heartbeat.interval.ms = 1000
interceptor.classes = []
internal.leave.group.on.close = true
internal.throw.on.fetch.stable.offset.unsupported = false
isolation.level = read_uncommitted
key.deserializer = class org.springframework.kafka.support.serializer.ErrorHandlingDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = SSL
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
ssl.endpoint.identification.algorithm = https
ssl.engine.factory.class = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.3
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = /app/resources/kafkatruststore.jks
ssl.truststore.password = [hidden]
ssl.truststore.type = JKS
value.deserializer = class org.springframework.kafka.support.serializer.ErrorHandlingDeserializer

Related

count bytes with influxs telegraf

I can receive messages with the inputs.mqtt_consumer telegraf plugin, but it gives me a lot of data in influxdb.
How can I in the telegraf configuration just count the number of received bytes and messages and report that to influx db?
# Configuration for telegraf agent
[agent]
interval = "20s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = false
[[outputs.influxdb_v2]]
urls = ["XXXXXXXXXXXXXXXX"]
token = "$INFLUX_TOKEN"
organization = "XXXXXXXXXXXXXXX"
bucket = "XXXXXXXXXXXXXXX"
[[inputs.mqtt_consumer]]
servers = ["tcp://XXXXXXXXXXXXXXXXXXXXX:1883"]
topics = [
"#",
]
data_format = "value"
data_type = "string"
I tried to google around but din't find any clear ways to do it.
I just want number of bytes and messages received each minute for the selected topic
I did not manage to receive all the messages and count them, but I found a solution where I can get the data from the broker. Not exactly what I asked for but fine for what I need.
topics = [
"$SYS/broker/load/messages/received/1min",
"$SYS/broker/load/messages/sent/1min",
]
...
data_format = "value"
data_type = "float"

Measuring the success rate of a command executed using Kusto Query

I am trying to find the success rate of a command (e.g call). I have scenario markers in place that marks the success, and the data is collected. Now I am using the Kusto queries to create a dashboard that measures the success rate when the command is triggered.
I was trying to use percentiles to measure the success rate of the command that was being used over a period of time as below.
Table
| where Table_Name == "call_command" and Table_Step == "CommandReceived"
| parse Table_MetaData with * "command = " command: string "," *
| where command == "call"
| summarize percentiles(command, 5, 50, 95) by Event_Time
The above query throws an error as "recognition error" occurred. Also, is this the right way to find the success rate of the command.
Updated :
Successfull command o/p :
call_command CommandReceived OK null null 1453 null [command = call,id = b444,retryAttempt = 0] [null] [null]
Unsuccessfull command o/p :
call_command STOP ERROR INVALID_VALUE Failed to execute command: call, id: b444, status code: 0, error code: INVALID_VALUE, error details: . 556 [command = call,id = b444,retryAttempt = 0] [null] [null]
Table name - call_command
Table_step - CommandReceived/STOP
Table_Metadata - [command = call,id = b444,retryAttempt = 0]
Table_status - OK/ERROR
Percentiles require that the first argument is numeric/bool/timespan/datetime, a string argument is not valid. It seems that the first step is to extract whether a call was successful, once you have such a column you can calculate the percentiles for it. Here is an example similar to your use-case:
let Table = datatable(Event_Time:datetime, Table_MetaData:string) [datetime(2021-05-01),"call_command CommandReceived OK null null 1453 null [command = call,id = b444,retryAttempt = 0] [null] [null] "
,datetime(2021-05-01),"call_command CommandReceived OK null null 1453 null [command = call,id = b444,retryAttempt = 0] [null] [null] "
,datetime(2021-05-01),"call_command STOP ERROR INVALID_VALUE Failed to execute command: call, id: b444, status code: 0, error code: INVALID_VALUE, error details: . 556 [command = call,id = b444,retryAttempt = 0] [null] [null]"]
| extend CommandStatus = split(Table_MetaData, " ")[2]
| extend Success = iif(CommandStatus == "OK", true, false)
| parse Table_MetaData with * "command = " command: string "," *
| where command == "call"
| summarize percentiles(Success, 5, 50, 95) by bin(Event_Time,1d);
Table

Neo.DatabaseError.General.UnknownError GC overhead limit exceeded in R 10.12.1

Totally new to neo4j, I was running the csv file when this issue occurred, how can I fix this? thanks so much!!
library("RNeo4j")
library("curl")
graph <- startGraph("http://localhost:7474/db/data", username = "neo4j", password = "")
clear(graph, input = F)
query <- "LOAD CSV WITH HEADERS FROM {csv} AS row CREATE (n:flights {year: row.year, month: row.mo, dep_time: row.dep_time, arr_time: row.arr_time, carrier: row.carrier, tailnum: row.tailnum, flight: row.flight, origin: row.origin, dest: row.dest, air_time: row.air_time, distance: row.distance, hour: row.hour, minute: row.minute })
cypher(graph, query, csv = "file:///flights1/flights.csv")
Error: Client error: (400) Bad Request
Neo.DatabaseError.General.UnknownError
GC overhead limit exceeded

SparkR + Cassandra query with complex data types

We have SparkR setup to connect to Cassandra and we are able to successfully connect/query Cassandra data. However, many of our Cassandra column families have complex data types like MapType and we get errors when querying these types. Is there a way to coerce these before or during the query using SparkR? For example, cqlsh command of the same data would coerce a row of MapType column b below to a string like "{38: 262, 97: 21, 98: 470}"
Sys.setenv(SPARK_HOME = "/opt/spark")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
mySparkPackages <- "datastax:spark-cassandra-connector:1.6.0-s_2.10"
mySparkEnvironment <- list(
spark.local.dir="...",
spark.eventLog.dir="...",
spark.cassandra.connection.host="...",
spark.cassandra.auth.username="...",
spark.cassandra.auth.password="...")
sc <- sparkR.init(master="...", sparkEnvir=mySparkEnvironment,sparkPackages=mySparkPackages)
sqlContext <- sparkRSQL.init(sc)
spark.df <- read.df(sqlContext,
source = "org.apache.spark.sql.cassandra",
keyspace = "...",
table = "...")
spark.df.sub <- subset(spark.df, (...)), select = c(1,2))
schema(spark.df.sub)
StructType
|-name = "a", type = "IntegerType", nullable = TRUE
|-name = "b", type = "MapType(IntegerType,IntegerType,true)", nullable = TRUE
r.df.sub <- collect(spark.df.sub, stringsAsFactors = FALSE)
Here we get this error from the collect():
16/07/13 12:13:50 INFO TaskSetManager: Finished task 1756.0 in stage 0.0 (TID 1756) in 1525 ms on ip-10-225-70-184.ec2.internal (1757/1758)
16/07/13 12:13:50 INFO TaskSetManager: Finished task 1755.0 in stage 0.0 (TID 1755) in 1661 ms on ip-10-225-70-184.ec2.internal (1758/1758)
16/07/13 12:13:50 INFO DAGScheduler: ResultStage 0 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 2587.670 s
16/07/13 12:13:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/07/13 12:13:50 INFO DAGScheduler: Job 0 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 2588.088830 s
16/07/13 12:13:51 ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
Error in readBin(con, raw(), stringLen, endian = "big") :
invalid 'n' argument
Our stack:
Ubuntu 14.04.4 LTS Trusty Tahr
Cassandra v 2.1.14
Scala 2.10.6
Spark 1.6.2 with Hadoop libs 2.6
Spark-Cassandra connector 1.6.0 for Scala 2.10
DataStax Cassandra Java driver v3.0 (v3.0.1 actually)
Microsoft R Open aka Revo R version 3.2.5 with MTL
Rstudio server 0.99.902

Install Postfix dovecot with mysql background

I have spent about 3 days to install the Postfix, dovecot and mysql on my VPS server. It has been a very frustrating process. I have googled painfully for 3 days and collected the information piece by piece and eventually made this combination work.
Just want to list steps and all configuration files together, hopefully useful for who is also undergoing the painful process.
make mysql ready, and create database postfix (or whatever the name you want), create mysql user postfix and grant all privilege to postfix database.
Create the following tables:
CREATE TABLE virtual_domains (
id int(11) NOT NULL auto_increment,
name varchar(50) NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE virtual_aliases (
id int(11) NOT NULL auto_increment,
domain_id int(11) NOT NULL,
source varchar(100) NOT NULL,
destination varchar(100) NOT NULL,
PRIMARY KEY (id),
FOREIGN KEY (domain_id) REFERENCES virtual_domains(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE virtual_users (
id int(11) NOT NULL auto_increment,
domain_id int(11) NOT NULL,
password varchar(32) NOT NULL,
email varchar(100) NOT NULL,
maildir varchar(255) NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY email (email),
FOREIGN KEY (domain_id) REFERENCES virtual_domains(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Compile Postfix with mysql support, you should see the bunch of postfix configuration files:
main.cf
[root#mail postfix]#postconf -n
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
command_directory = /var/postfix/usr/sbin
compatibility_level = 2
daemon_directory = /var/postfix/usr/libexec/postfix
data_directory = /var/lib/postfix
debug_peer_level = 2
debugger_command = PATH=/bin:/usr/bin:/usr/local/bin:/usr/X11R6 /binddd
$daemon_directory/$process_name $process_id & sleep 5
home_mailbox = Maildir/
html_directory = no
inet_interfaces = all
inet_protocols = ipv4
mail_owner = postfix
mail_spool_directory = /home
mailq_path = /var/postfix/usr/bin/mailq
manpage_directory = /usr/local/man
meta_directory = /etc/postfix
mydomain = myspeedshow.com
myhostname = mail.yourdoamin.com
mynetworks_style = host
myorigin = $mydomain
newaliases_path = /var/postfix/usr/bin/newaliases
postscreen_greet_banner = "before smtp banner"
postscreen_greet_wait = 2s
postscreen_non_smtp_command_enable = no
postscreen_pipelining_enable = no
queue_directory = /var/spool/postfix
readme_directory = no
recipient_delimiter = +
sample_directory = /etc/postfix
sendmail_path = /var/postfix/usr/sbin/sendmail
setgid_group = postdrop
shlib_directory = no
smtpd_banner = $myhostname ESMTP $mail_name
smtpd_recipient_restrictions =
reject_invalid_hostname,<br>
reject_unknown_recipient_domain,
reject_unauth_pipelining,
permit_mynetworks,
reject_unauth_destination,
reject_rbl_client zen.spamhaus.org,
reject_rbl_client bl.spamcop.net,
reject_rbl_client dnsbl.sorbs.net,
reject_rbl_client cbl.abuseat.org,
reject_rbl_client b.barracudacentral.org,
reject_rbl_client dnsbl-1.uceprotect.net,
permit<br>
smtpd_sasl_auth_enable = yes
smtpd_sasl_path = private/auth
smtpd_sasl_type = dovecot
smtpd_tls_auth_only = yes
smtputf8_enable = no
unknown_local_recipient_reject_code = 550
virtual_alias_maps = mysql:/etc/postfix/mysql-virtual-alias- maps.cf,mysql:/etc/postfix/mysql-email2email.cf
virtual_gid_maps = static:5000
virtual_mailbox_base = /var/vmail
virtual_mailbox_domains = mysql:/etc/postfix/mysql-virtual-mailbox-domains.cf
virtual_mailbox_limit = 51200000
virtual_mailbox_maps = mysql:/etc/postfix/mysql-virtual-mailbox-maps.cf
virtual_transport = virtual
virtual_uid_maps = static:5000
master.cf
relay unix - - n - - smtp
flush unix n - n 1000? 0 flush
trace unix - - n - 0 bounce
verify unix - - n - 1 verify
rewrite unix - - - - - trivial-rewrite
proxymap unix - - n - - proxymap
anvil unix - - n - 1 anvil
scache unix - - n - 1 scache
discard unix - - n - - discard
tlsmgr unix - - n 1000? 1 tlsmgr
retry unix - - n - - error
proxywrite unix - - n - 1 proxymap
smtp unix - - n - - smtp
smtp inet n - n - 1 postscreen
smtpd pass - - n - - smtpd
lmtp unix - - n - - lmtp
cleanup unix n - n - 0 cleanup
qmgr fifo n - n 300 1 qmgr
virtual unix - n n - - virtual
dovecot unix - n n - - pipe
flags=DRhu user=vmail:vmail argv=/usr/local/libexec/dovecot /dovecot-lda -f ${sender} -d ${recipient}
mysql-virtual-mailbox-domains.cf
user=postfix
password=yourpassword
host=127.0.0.1
dbname=postfix
query=select name from virtual_domains where name='%s'
mysql-virtual-mailbox-maps.cf
user=postfix
password=yourpassword
dbname=postfix
query=select maildir from virtual_users where email='%s'
mysql-virtual-alias-maps.cf
user=postfix
password=yourpassword
host=127.0.0.1
dbname=postfix
query=select destination from virtual_aliases where source='%s'
The next step is to configure the Dovecot.
10-auth.conf
disable_plaintext_auth = yes
auth_mechanisms = plain login
!include auth-sql.conf.ext
comments out all other !include
auth-sql.conf.ext
passdb {
driver = sql
args = /etc/dovecot/dovecot-sql.conf.ext
}
userdb {
driver = static
args = uid=vmail gid=vmail home=/var/vmail/%d/%n
}
10-mail.conf
comments out all mail_location
Here we use Maildir format to store the email in:
/var/vmail/domain/user/Maildir/ folder, in virtual_users table, the column maildir should be in the following format 'yourdomain.com/user/Maildir/'
If you have not populated the virtual_users.maildir column correctly, the postfix will use mailbox format, which store all mail belong to a domain to a file /var/vmail/1.

Resources