Options to reduce processing rate using Spring Kafka - spring-kafka

I'm using Spring Boot 2.2.7 and Spring Kafka. I have a KafkaListener which is a continuously processing stats data from a topic and writing the data into MongoDB and Elasticsearch (using Spring Data).
My configuration is as follows:
#Configuration
public class StatListenerConfig {
#Autowired
private KafkaConfig kafkaConfig;
#Bean
public ConsumerFactory<String, StatsRequestDto> statsConsumerFactory() {
return new DefaultKafkaConsumerFactory<>(kafkaConfig.statsConsumerConfigs());
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, StatsRequestDto> kafkaStatsListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, StatsRequestDto> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(statsConsumerFactory());
factory.getContainerProperties().setAckMode(AckMode.RECORD);
return factory;
}
}
#Service
public class StatListener {
private static final Logger LOGGER = LoggerFactory.getLogger(StatListener.class);
#Autowired
private StatsService statsService;
#KafkaListener(topics = "${kafka.topic.stats}", containerFactory = "kafkaStatsListenerContainerFactory")
public void receive(#Payload StatsRequestDto data) {
Stat stats = statsService.convertToStats(data);
statsService.save(stats).get();
}
}
The save method is an async method.
The problem I am having is that when the queue is being processed, Elastisearch CPU consumption is around 250%. This leads to sporadic timeout errors across the application. I am looking into how I can optimise Elasticsearch as indexing can cause CPU spikes.
I wanted to check that if I used an async method (like above), the next message from the topic would not be processed until the previous one had completed. If that is correct, what options are there in Spring Kafka that I could use to relieve pressure of a downstream operation that might take time to complete.
Any advice would be much appreciated.

In version 2.3, we added the idleBetweenPolls container property.
With earlier versions, you could simulate that by, say, sleeping in the consumer for some time after some number of records.
You just need to be sure the sleep+processing time for the records returned by a poll does not exceed max.poll.intervsl.ms, to avoid a rebalance.

Related

Unable to set some producer settings for kafka with spring boot

I'm trying to set the retry.backoff.ms setting for kafka in my producer using the DefaultKafkaProducerFactory from org.springframework.kafka.core. Here's what I got:
public class KafkaProducerFactory extends DefaultKafkaProducerFactory {
public KafkaProducerFactory(Map<String, Object> config) {
super(config);
}
#Configuration
public class MyAppProducerConfig {
#Value("${myApp.delivery-timeout-ms:#{120000}}")
private int deliveryTimeoutMs;
#Value("${myApp.retry-backoff-ms:#{30000}}")
private int retryBackoffMs;
Producer<MyKey, MyValue> myAppProducer() {
Map<String, Object> config = new HashMap<>();
config.put(org.apache.kafka.clients.producer.ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, deliveryTimeoutMs);
config.put(org.apache.kafka.clients.producer.ProducerConfig.RETRY_BACKOFF_MS_CONFIG, retryBackoffMs);
final var factory = new KafkaProducerFactory<MyKey, MyValue>(config);
return factory.createProducer(); // calls DefaultKafkaProducerFactory
}
Now when I add the following to my application.yaml
myApp:
retry-backoff-ms = 50
delivery-timeout-ms = 1000
This is what I see in the logging when I start the application:
o.a.k.clients.producer.ProducerConfig : ProducerConfig values:
delivery.timeout.ms = 1000
retry.backoff.ms = 1000
so the delivery.timeout.ms was set, but the retry.backoff.ms wasn't even though I did the exact same for both.
I did find how to set application properties to default kafka producer template without setting from kafka producer config bean, but I didn't see either property listed under integrated properties.
So hopefully someone can give me some pointers.
After an intense debugging session I found the issue. DefaultKafkaProducerFactory is in a shared library between teams and I'm not super familiar with the class since it's my first time touching it.
Turns out the createProducer() call in DefaultKafkaProducerFactory calls another function that is overriden in KafkaProducerFactory which then creates an AxualProducer.
And the AxualProducerConfig always sets retry.backoff.ms to 1000ms.

How to get threadlocal for concurrency consumer?

I am developing spring kafka consumer. Due to message volume, I need use concurrency to make sure throughput. Due to used concurrency, I used threadlocal object to save thread based data. Now I need remove this threadlocal object after use it.
Spring document with below links suggested to implement a EventListener which listen to event ConsumerStoppedEvent . But did not mention any sample eventlistener code to get threadlocal object and remove the value. May you please let me know how to get the threadlocal instance in this case?
Code samples will be appreciated.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#thread-safety
Something like this:
#SpringBootApplication
public class So71884752Application {
public static void main(String[] args) {
SpringApplication.run(So71884752Application.class, args);
}
#Bean
public NewTopic topic2() {
return TopicBuilder.name("topic1").partitions(2).build();
}
#Component
static class MyListener implements ApplicationListener<ConsumerStoppedEvent> {
private static final ThreadLocal<Long> threadLocalState = new ThreadLocal<>();
#KafkaListener(topics = "topic1", groupId = "my-consumer", concurrency = "2")
public void listen() {
long id = Thread.currentThread().getId();
System.out.println("set thread id to ThreadLocal: " + id);
threadLocalState.set(id);
}
#Override
public void onApplicationEvent(ConsumerStoppedEvent event) {
System.out.println("Remove from ThreadLocal: " + threadLocalState.get());
threadLocalState.remove();
}
}
}
So, I have two concurrent listener containers for those two partitions in the topic. Each of them is going to call this my #KafkaListener method anyway. I store the thread id into the ThreadLocal. For simple use-case and testing the feature.
The I implement ApplicationListener<ConsumerStoppedEvent> which is emitted in the appropriate consumer thread. And that one helps me to extract ThreadLocal value and clean it up in the end of consumer life.
The test against embedded Kafka looks like this:
#SpringBootTest
#EmbeddedKafka(bootstrapServersProperty = "spring.kafka.bootstrap-servers")
#DirtiesContext
class So71884752ApplicationTests {
#Autowired
KafkaTemplate<String, String> kafkaTemplate;
#Autowired
KafkaListenerEndpointRegistry kafkaListenerEndpointRegistry;
#Test
void contextLoads() throws InterruptedException {
this.kafkaTemplate.send("topic1", "1", "foo");
this.kafkaTemplate.send("topic1", "2", "bar");
this.kafkaTemplate.flush();
Thread.sleep(1000); // Give it a chance to consume data
this.kafkaListenerEndpointRegistry.stop();
}
}
Right. It doesn't verify anything, but it demonstrate how that event can happen.
I see something like this in log output:
set thread id to ThreadLocal: 125
set thread id to ThreadLocal: 127
...
Remove from ThreadLocal: 125
Remove from ThreadLocal: 127
So, whatever that doc says is correct.

Empty data set returned after a few repeated calls to DAO

I have got a serious problem where the DAO-layer stops returning records after a few calls. I'm using Spring Framework 5.3.10. The main components involved are:
Spring MVC Connection pooling over HikariCP 5.0.0
JDBC connector Jaybird 4.0.3 (Firebird 3.0.7 database server)
ThreadPoolExecutor (using default values)
Spring Transactions
Mybatis
I have got one Spring controller (A), that repeatedly (every 2 - 3 seconds) calls a method of a Spring Service (B) asynchronously (method marked with #Async) and a different parameter for each call. There is a DAO-layer (C) declared as a Spring service. The worker method in the Spring service (B) calls a DAO-method in the beginning of each run to retrieve a data set from a database table corresponding to the passed parameter. At the end of the execution of the worker method in the Spring service (B), rows corresponding to the input parameter are updated (not the field corresponding to the input parameter). The method in the Spring service (B) takes a long time to process the data, about 10 - 15 seconds.
After about the third or fourth call from the Spring controller (A), the call to the DAO-method returns an empty result set. When calling the method in the Spring service (B) slowly, waiting for the previous call to complete, everything is working correctly.
Setting transaction isolation has got no effect whatsoever.
I have tried to solve this problem for a couple of days now, and getting nowhere. I would be very grateful if somebody can point me in the right direction how to solve this. Using some kind of mutexes or semaphores is just a way to circumvent the problem without really solving it.
Schematically
Controller A <---------
| |
| | repeats every 2-3 secs.
Service B |
worker method |
takes 15 - 20 secs. ----
calls DAO-method getData(token)
|
do work
|
calls DAO-method updateData(token)
Controller (A)
#Controller
#RequestMapping("/test")
public class TestController {
#Autowired
private TestService testService;
...
...
#GetMapping(value="/RunWorker")
public String runWorker(ModelMap map, HttpServletRequest hsr) {
...
testService.workerMethod(token);
...
}
}
Service (B)
public interface TestService {
public void workerMethod(long token);
}
#Service
public class TestServiceImpl implements TestService {
#Autowired
private TestDAO testDao;
#Override
public void workerMethod(long token) {
List<MyData> myDataSet = testDao.getData(token);
...
// very long process
...
testDao.updateData(token);
}
}
DAO (C)
public interface TestDAO {
public List<MyData> getData(long token);
public void updateData(long token);
}
#Service
public class TestDAOImpl implements TestDAO {
#Autowired
private TestMapper testMapper; // using Mybatis mappers
public List<MyData> getData(long token) {
return testMapper.getData(token);
}
public void updateData(long token) {
testMapper.updateData(token);
}
}
Mapper class (D)
public interface TestMapper {
#Select("SELECT * FROM TESTTABLE WHERE TOKEN=#{token}")
public List<MyData> getData(#Param("token") long token);
#Update("UPDATE TESTTABLE SET STATUS=9 WHERE TOKEN=#{token}
public void updateData(#Param("token") long token);
}
Thanks #M. Deinum for the suggestion about #Repository. This did not help, however.
I remade the Spring service (B) to a Spring bean with prototype scope, and injecting it with #Lookup. The behavior is still the same. After the second call, the DAO-method getData returns an empty result set. Very puzzling and frustrating.
I solved the problem. It was probably resource exhaustion due to repeated multiple calls to the Spring service (B) with the same call parameters. I guess the statement pool got depleted, active statements not returning fast enough, and then returning empty data sets for each call.
Best regards,
Peter

How to activate RequestScope inside CompletableFuture (getting org.jboss.weld.context.ContextNotActiveException) [duplicate]

I could not find a definitive answer to whether it is safe to spawn threads within session-scoped JSF managed beans. The thread needs to call methods on the stateless EJB instance (that was dependency-injected to the managed bean).
The background is that we have a report that takes a long time to generate. This caused the HTTP request to time-out due to server settings we can't change. So the idea is to start a new thread and let it generate the report and to temporarily store it. In the meantime the JSF page shows a progress bar, polls the managed bean till the generation is complete and then makes a second request to download the stored report. This seems to work, but I would like to be sure what I'm doing is not a hack.
Check out EJB 3.1 #Asynchronous methods. This is exactly what they are for.
Small example that uses OpenEJB 4.0.0-SNAPSHOTs. Here we have a #Singleton bean with one method marked #Asynchronous. Every time that method is invoked by anyone, in this case your JSF managed bean, it will immediately return regardless of how long the method actually takes.
#Singleton
public class JobProcessor {
#Asynchronous
#Lock(READ)
#AccessTimeout(-1)
public Future<String> addJob(String jobName) {
// Pretend this job takes a while
doSomeHeavyLifting();
// Return our result
return new AsyncResult<String>(jobName);
}
private void doSomeHeavyLifting() {
try {
Thread.sleep(SECONDS.toMillis(10));
} catch (InterruptedException e) {
Thread.interrupted();
throw new IllegalStateException(e);
}
}
}
Here's a little testcase that invokes that #Asynchronous method several times in a row.
Each invocation returns a Future object that essentially starts out empty and will later have its value filled in by the container when the related method call actually completes.
import javax.ejb.embeddable.EJBContainer;
import javax.naming.Context;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
public class JobProcessorTest extends TestCase {
public void test() throws Exception {
final Context context = EJBContainer.createEJBContainer().getContext();
final JobProcessor processor = (JobProcessor) context.lookup("java:global/async-methods/JobProcessor");
final long start = System.nanoTime();
// Queue up a bunch of work
final Future<String> red = processor.addJob("red");
final Future<String> orange = processor.addJob("orange");
final Future<String> yellow = processor.addJob("yellow");
final Future<String> green = processor.addJob("green");
final Future<String> blue = processor.addJob("blue");
final Future<String> violet = processor.addJob("violet");
// Wait for the result -- 1 minute worth of work
assertEquals("blue", blue.get());
assertEquals("orange", orange.get());
assertEquals("green", green.get());
assertEquals("red", red.get());
assertEquals("yellow", yellow.get());
assertEquals("violet", violet.get());
// How long did it take?
final long total = TimeUnit.NANOSECONDS.toSeconds(System.nanoTime() - start);
// Execution should be around 9 - 21 seconds
assertTrue("" + total, total > 9);
assertTrue("" + total, total < 21);
}
}
Example source code
Under the covers what makes this work is:
The JobProcessor the caller sees is not actually an instance of JobProcessor. Rather it's a subclass or proxy that has all the methods overridden. Methods that are supposed to be asynchronous are handled differently.
Calls to an asynchronous method simply result in a Runnable being created that wraps the method and parameters you gave. This runnable is given to an Executor which is simply a work queue attached to a thread pool.
After adding the work to the queue, the proxied version of the method returns an implementation of Future that is linked to the Runnable which is now waiting on the queue.
When the Runnable finally executes the method on the real JobProcessor instance, it will take the return value and set it into the Future making it available to the caller.
Important to note that the AsyncResult object the JobProcessor returns is not the same Future object the caller is holding. It would have been neat if the real JobProcessor could just return String and the caller's version of JobProcessor could return Future<String>, but we didn't see any way to do that without adding more complexity. So the AsyncResult is a simple wrapper object. The container will pull the String out, throw the AsyncResult away, then put the String in the real Future that the caller is holding.
To get progress along the way, simply pass a thread-safe object like AtomicInteger to the #Asynchronous method and have the bean code periodically update it with the percent complete.
Introduction
Spawning threads from within a session scoped managed bean is not necessarily a hack as long as it does the job you want. But spawning threads at its own needs to be done with extreme care. The code should not be written that way that a single user can for example spawn an unlimited amount of threads per session and/or that the threads continue running even after the session get destroyed. It would blow up your application sooner or later.
The code needs to be written that way that you can ensure that an user can for example never spawn more than one background thread per session and that the thread is guaranteed to get interrupted whenever the session get destroyed. For multiple tasks within a session you need to queue the tasks.
Also, all those threads should preferably be served by a common thread pool so that you can put a limit on the total amount of spawned threads at application level.
Managing threads is thus a very delicate task. That's why you'd better use the built-in facilities rather than homegrowing your own with new Thread() and friends. The average Java EE application server offers a container managed thread pool which you can utilize via among others EJB's #Asynchronous and #Schedule. To be container independent (read: Tomcat-friendly), you can also use the Java 1.5's Util Concurrent ExecutorService and ScheduledExecutorService for this.
Below examples assume Java EE 6+ with EJB.
Fire and forget a task on form submit
#Named
#RequestScoped // Or #ViewScoped
public class Bean {
#EJB
private SomeService someService;
public void submit() {
someService.asyncTask();
// ... (this code will immediately continue without waiting)
}
}
#Stateless
public class SomeService {
#Asynchronous
public void asyncTask() {
// ...
}
}
Asynchronously fetch the model on page load
#Named
#RequestScoped // Or #ViewScoped
public class Bean {
private Future<List<Entity>> asyncEntities;
#EJB
private EntityService entityService;
#PostConstruct
public void init() {
asyncEntities = entityService.asyncList();
// ... (this code will immediately continue without waiting)
}
public List<Entity> getEntities() {
try {
return asyncEntities.get();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new FacesException(e);
} catch (ExecutionException e) {
throw new FacesException(e);
}
}
}
#Stateless
public class EntityService {
#PersistenceContext
private EntityManager entityManager;
#Asynchronous
public Future<List<Entity>> asyncList() {
List<Entity> entities = entityManager
.createQuery("SELECT e FROM Entity e", Entity.class)
.getResultList();
return new AsyncResult<>(entities);
}
}
In case you're using JSF utility library OmniFaces, this could be done even faster if you annotate the managed bean with #Eager.
Schedule background jobs on application start
#Singleton
public class BackgroundJobManager {
#Schedule(hour="0", minute="0", second="0", persistent=false)
public void someDailyJob() {
// ... (runs every start of day)
}
#Schedule(hour="*/1", minute="0", second="0", persistent=false)
public void someHourlyJob() {
// ... (runs every hour of day)
}
#Schedule(hour="*", minute="*/15", second="0", persistent=false)
public void someQuarterlyJob() {
// ... (runs every 15th minute of hour)
}
#Schedule(hour="*", minute="*", second="*/30", persistent=false)
public void someHalfminutelyJob() {
// ... (runs every 30th second of minute)
}
}
Continuously update application wide model in background
#Named
#RequestScoped // Or #ViewScoped
public class Bean {
#EJB
private SomeTop100Manager someTop100Manager;
public List<Some> getSomeTop100() {
return someTop100Manager.list();
}
}
#Singleton
#ConcurrencyManagement(BEAN)
public class SomeTop100Manager {
#PersistenceContext
private EntityManager entityManager;
private List<Some> top100;
#PostConstruct
#Schedule(hour="*", minute="*/1", second="0", persistent=false)
public void load() {
top100 = entityManager
.createNamedQuery("Some.top100", Some.class)
.getResultList();
}
public List<Some> list() {
return top100;
}
}
See also:
Spawning threads in a JSF managed bean for scheduled tasks using a timer
I tried this and works great from my JSF managed bean
ExecutorService executor = Executors.newFixedThreadPool(1);
#EJB
private IMaterialSvc materialSvc;
private void updateMaterial(Material material, String status, Location position) {
executor.execute(new Runnable() {
public void run() {
synchronized (position) {
// TODO update material in audit? do we need materials in audit?
int index = position.getMaterials().indexOf(material);
Material m = materialSvc.getById(material.getId());
m.setStatus(status);
m = materialSvc.update(m);
if (index != -1) {
position.getMaterials().set(index, m);
}
}
}
});
}
#PreDestroy
public void destory() {
executor.shutdown();
}

#Async annotation is creating threads, but only one thread is taking all the load

I have a requirement to persist huge payload to database. So I decided to use asynchronous call to persist a batch of records. I enabled Asynchronous processing by using #EnableAsync annotation. I also used #Async on a method on my service layer as below
#Async
#Transactional
public CompletableFuture<Boolean> insertRecords(List<Record> records) {
recordRepository.saveAll(records);
recordRepository.flush();
LOGGER.debug(Thread.currentThread().getName()+" -> inserting);
return CompletableFuture.completedFuture(Boolean.TRUE);
}
Above method is called from another service method
#Transactional
public void performSomeDB(InputStream is){
//perform another CRUD operation
processStream(is);
}
private void processStream(InputStream is){
//Read stream using JsonReader and load into a list
// record by record. Once the desired batch is met, pass the
// list to insertRecords
List<Record> records = new ArrayList<>();
List<CompletableFuture<Boolean>> statuses = new ArrayList<>();
while(stream has data){
records.add(record);
statuses.add(insertRecords(records);
}
System.out.println(statuses.size()); // It returns >1 based on the iterations.
Some of the code added above is more symbolic, than actual code.
When I looked into logs, I see that statuses.size() are returning value >1, which means more threads are spawned. But only one thread is used to persist and that is running in sequence for each iteration.
http-nio-9090-exec-10 -> insert records
http-nio-9090-exec-10 -> insert records.
......................................
In logs, I see only one thread is running and persisting a batch of records in sequential order.
Why only one thread is taking the load to persist all records.
Is my approach incorrect?
As for #Async annotation , self-invocation – calling the async method from within the same class – won’t work.
You should make the method in a separate class and reference it using the bean object of this class.
#Component
public class DbInserter {
#Async
#Transactional
public CompletableFuture<Boolean> insertRecords(List<Record> records) {
recordRepository.saveAll(records);
recordRepository.flush();
LOGGER.debug(Thread.currentThread().getName()+" -> inserting);
return CompletableFuture.completedFuture(Boolean.TRUE);
}
}
That's the magic and general idea of Async. It's sharing the full load without generating several threads.
If you are using Spring's Java-configuration, your config class needs to implements AsyncConfigurer:
#Configuration
#EnableAsync
public class AppConfig implements AsyncConfigurer {
#Override
public Executor getAsyncExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(5);
executor.setMaxPoolSize(10);
executor.setQueueCapacity(50);
executor.setThreadNamePrefix("MyPool");
executor.initialize();
return executor;
}
}
You can refer below document for more details : http://docs.spring.io/spring/docs/3.1.x/javadoc-api/org/springframework/scheduling/annotation/EnableAsync.html

Resources