InvalidPidMappingException in KafkaListener BeginTransaction - spring-kafka

We have been getting lots of InvalidPidMappingException failures recently.
I came across https://github.com/spring-projects/spring-kafka/issues/660, but this is specific to producerOnly publishing in tx.
In our case, this is happening in consumers, so it cannot be resolved using retries (keeps failing with each rollback retry).
There's a fix in kafka to address this exception, but will probably be released in 4 months (v2.5.0). https://github.com/apache/kafka/pull/7115
Is there a way we can recover from these failures?
One option we're considering is to use ContainerAwareErrorHandler to restart the listenercontainer in case of these 2 specific exceptions. Any issues with this approach? what would happen if another listener is processing another message?
org.apache.kafka.common.KafkaException: Cannot execute transactional method because we are in an error state
at org.apache.kafka.clients.producer.internals.TransactionManager.maybeFailWithError(TransactionManager.java:924)
at org.apache.kafka.clients.producer.internals.TransactionManager.beginTransaction(TransactionManager.java:290)
at org.apache.kafka.clients.producer.KafkaProducer.beginTransaction(KafkaProducer.java:642)
at org.springframework.kafka.core.DefaultKafkaProducerFactory$CloseSafeProducer.beginTransaction(DefaultKafkaProducerFactory.java:597)
at org.springframework.kafka.core.ProducerFactoryUtils.getTransactionalResourceHolder(ProducerFactoryUtils.java:101)
at org.springframework.kafka.transaction.KafkaTransactionManager.doBegin(KafkaTransactionManager.java:160)
at com.projectdrgn.common.messaging.config.KafkaProducerConfig$kafkaTransactionManager$kafkaTransactionManager$1.doBegin(KafkaProducerConfig.kt:65)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:376)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:137)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.recordAfterRollback(KafkaMessageListenerContainer.java:1463)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeRecordListenerInTx(KafkaMessageListenerContainer.java:1438)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeRecordListener(KafkaMessageListenerContainer.java:1398)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeListener(KafkaMessageListenerContainer.java:1165)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:949)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:884)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.
at org.apache.kafka.clients.producer.internals.TransactionManager$AddPartitionsToTxnHandler.handleResponse(TransactionManager.java:1204)
at org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1069)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:561)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:553)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:307)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
... 1 common frames omitted
Caused by: org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.

Related

Spring kafka MessageListener and max.poll.records

I am using spring kafka 2.7.8 to consume messages from kafka.
Consumer listener is as below
#KafkaListener(topics = "topicName",
groupId = "groupId",
containerFactory = "kafkaListenerFactory")
public void onMessage(ConsumerRecord record) {
}
Above onMessage method receives single message at a time.
Does this mean max.poll.records is set to 1 by spring library or it polls 500 at a time(default value) and the method receives one by one.
Reason for this question is, we often see below errors together in prod.
Received all 4 errors below for multiple consumers in under a minute.
Trying to understand whether it is due to intermittent kafka broker connectivity issue or due to load. Please advise.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
seek to current after exception; nested exception is org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {topic-9=OffsetAndMetadata{offset=2729058, leaderEpoch=null, metadata=''}}
Consumer clientId=consumer-groupName-5, groupId=consumer] Offset commit failed on partition topic-33 at offset 2729191: The coordinator is not aware of this member.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records
max.poll.records is not changed by Spring; it will take the default (or whatever you set it to). The records are handed to the listener one at a time before the next poll.
This means that your listener must be able to process max.poll.records within max.poll.interval.ms.
You need to reduce max.poll.records and/or increase max.poll.interval.ms so that you can process the records in that time, with a generous margin, to avoid these rebalances.

Grails 3.3.10: running integration tests Error creating bean with name 'com.cabolabs.security.UserController'

I'm trying to create an integration test for a service and keep receiving a controller, that is totally disconnected from the service test, can't be created. Spending a day or so on everything that might cause the issue, and failing miserably, decided to create a project from scratch and start adding line by line.
So I did:
grails create-app test
grails create-domain-class com.cabolabs.security.User
added username and password String fields
grails generate-app com.cabolabs.security.User
grails create-service com.cabolabs.cloud.BalanceUpdate
grails create-integration-test com.cabolabs.cloud.BalanceUpdate
grails test-app com.cabolabs.cloud.BalanceUpdate -integration
That runs OK, the test fails because of the default code, that is not important.
Then I started to add references to services in the UserController, and the BalanceUpdateService, like mailService from the mail plugin.
The test worked as before.
Then I added this line, which I have extensively used in many controllers of my original project:
def config = grailsApplication.config
With that line, the whole thing felt apart and got the error:
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.cabolabs.security.UserController': Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.cabolabs.security.UserController]: Constructor threw exception; nested exception is java.lang.IllegalStateException: No thread-bound request found: Are you referring to request attributes outside of an actual web request, or processing a request outside of the originally receiving thread? If you are actually operating within a web request and still receive this message, your code is probably running outside of DispatcherServlet/DispatcherPortlet: In this case, use RequestContextListener or RequestContextFilter to expose the current request.
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateBean(AbstractAutowireCapableBeanFactory.java:1160)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1104)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:511)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:481)
at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:312)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:308)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
at org.springframework.context.support.AbstractApplicationContext.getBean(AbstractApplicationContext.java:1080)
at org.spockframework.spring.SpringMockTestExecutionListener.beforeTestMethod(SpringMockTestExecutionListener.java:54)
at org.spockframework.spring.AbstractSpringTestExecutionListener.beforeTestMethod(AbstractSpringTestExecutionListener.java:23)
at org.springframework.test.context.TestContextManager.beforeTestMethod(TestContextManager.java:269)
at org.spockframework.spring.SpringTestContextManager.beforeTestMethod(SpringTestContextManager.java:54)
at org.spockframework.spring.SpringInterceptor.interceptSetupMethod(SpringInterceptor.java:45)
at org.spockframework.runtime.extension.AbstractMethodInterceptor.intercept(AbstractMethodInterceptor.java:28)
at org.spockframework.runtime.extension.MethodInvocation.proceed(MethodInvocation.java:87)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:114)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:57)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:66)
at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:109)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:147)
at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:129)
at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:404)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63)
at org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:46)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.cabolabs.security.UserController]: Constructor threw exception; nested exception is java.lang.IllegalStateException: No thread-bound request found: Are you referring to request attributes outside of an actual web request, or processing a request outside of the originally receiving thread? If you are actually operating within a web request and still receive this message, your code is probably running outside of DispatcherServlet/DispatcherPortlet: In this case, use RequestContextListener or RequestContextFilter to expose the current request.
at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:154)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:89)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateBean(AbstractAutowireCapableBeanFactory.java:1152)
... 34 more
Caused by: java.lang.IllegalStateException: No thread-bound request found: Are you referring to request attributes outside of an actual web request, or processing a request outside of the originally receiving thread? If you are actually operating within a web request and still receive this message, your code is probably running outside of DispatcherServlet/DispatcherPortlet: In this case, use RequestContextListener or RequestContextFilter to expose the current request.
at org.springframework.web.context.request.RequestContextHolder.currentRequestAttributes(RequestContextHolder.java:131)
at grails.web.api.WebAttributes$Trait$Helper.currentRequestAttributes(WebAttributes.groovy:45)
at grails.web.api.WebAttributes$Trait$Helper.getGrailsAttributes(WebAttributes.groovy:54)
at grails.web.api.WebAttributes$Trait$Helper.getGrailsApplication(WebAttributes.groovy:134)
at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:142)
... 36 more
My question is, how can I have the config injected to my controllers and make the integration tests work? Thanks.
I just created a project from scratch and added line by line, it seems adding this to the controller, makes the test for the service fail (which are not connected in any way): def config = grailsApplication.config
Since I used that in many controllers, I tested changing it to Holders.config and that actually worked. I'll report this as a bug to Grails Core.

Firebase Strictmode resource leak

After converting from Crashlytics via Fabric to Crashlytics via Firebase, I started seeing the below call stack in debug runs where StrictMode is enabled looking for resource leaks.
StrictMode is in use with this code, only on debug builds:
StrictMode.setVmPolicy(new StrictMode.VmPolicy.Builder()
.detectLeakedClosableObjects()
.penaltyLog()
.build());
I'm using this versions of Fabric's gradle tools in project-level gradle:
classpath "io.fabric.tools:gradle:1.27.0"
and these versions of Firebase and Crashlytics in module-level gradle:
implementation "com.google.firebase:firebase-core:16.0.7"
implementation "com.crashlytics.sdk.android:crashlytics:2.9.8"
During initialization, the Firebase instrumentation kicks off a background thread that is doing settings calls using okhttp. When it does, StrictMode causes this call stack to pop:
W/CrashlyticsCore: Received null settings, skipping report submission!
D/StrictMode: StrictMode policy violation: android.os.strictmode.LeakedClosableViolation: A resource was acquired at attached stack trace but never released. See java.io.Closeable for information on avoiding resource leaks.
at android.os.StrictMode$AndroidCloseGuardReporter.report(StrictMode.java:1786)
at dalvik.system.CloseGuard.warnIfOpen(CloseGuard.java:264)
at java.util.zip.Inflater.finalize(Inflater.java:398)
at java.lang.Daemons$FinalizerDaemon.doFinalize(Daemons.java:250)
at java.lang.Daemons$FinalizerDaemon.runInternal(Daemons.java:237)
at java.lang.Daemons$Daemon.run(Daemons.java:103)
at java.lang.Thread.run(Thread.java:764)
Caused by: java.lang.Throwable: Explicit termination method 'end' not called
at dalvik.system.CloseGuard.open(CloseGuard.java:221)
at java.util.zip.Inflater.<init>(Inflater.java:114)
at com.android.okhttp.okio.GzipSource.<init>(GzipSource.java:62)
at com.android.okhttp.internal.http.HttpEngine.unzip(HttpEngine.java:473)
at com.android.okhttp.internal.http.HttpEngine.readResponse(HttpEngine.java:648)
at com.android.okhttp.internal.huc.HttpURLConnectionImpl.execute(HttpURLConnectionImpl.java:471)
at com.android.okhttp.internal.huc.HttpURLConnectionImpl.getResponse(HttpURLConnectionImpl.java:407)
at com.android.okhttp.internal.huc.HttpURLConnectionImpl.getResponseCode(HttpURLConnectionImpl.java:538)
at com.android.okhttp.internal.huc.DelegatingHttpsURLConnection.getResponseCode(DelegatingHttpsURLConnection.java:105)
at com.android.okhttp.internal.huc.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:26)
at io.fabric.sdk.android.services.network.HttpRequest.code(HttpRequest.java:1357)
at io.fabric.sdk.android.services.settings.DefaultSettingsSpiCall.handleResponse(DefaultSettingsSpiCall.java:104)
at io.fabric.sdk.android.services.settings.DefaultSettingsSpiCall.invoke(DefaultSettingsSpiCall.java:88)
at io.fabric.sdk.android.services.settings.DefaultSettingsController.loadSettingsData(DefaultSettingsController.java:90)
at io.fabric.sdk.android.services.settings.DefaultSettingsController.loadSettingsData(DefaultSettingsController.java:67)
at io.fabric.sdk.android.services.settings.Settings.loadSettingsData(Settings.java:153)
at io.fabric.sdk.android.Onboarding.retrieveSettingsData(Onboarding.java:126)
at io.fabric.sdk.android.Onboarding.doInBackground(Onboarding.java:99)
at io.fabric.sdk.android.Onboarding.doInBackground(Onboarding.java:45)
at io.fabric.sdk.android.InitializationTask.doInBackground(InitializationTask.java:63)
at io.fabric.sdk.android.InitializationTask.doInBackground(InitializationTask.java:28)
at io.fabric.sdk.android.services.concurrency.AsyncTask$2.call(AsyncTask.java:311)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:458)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:764) 
D/FA: Event not sent since app measurement is disabled
I see this happening on pretty much every debug run of my app, during the initial activity startup in the app. However support claims they don't see this.
Does anyone know what conditions cause Fabric to kick off this onboarding/settings thread?
I've seen similar StrictMode call stacks in these other posts. I can't tell if this leak is in Fabric code, or in the okhttp library they are using. Here are links to similar cases where people are seeing what looks to me like the same underlying resource leak:
StrictMode penalising for Firebase Ads
Crashlytics with StrictMode enabled (detect all) gives "untagged socket detected"
https://github.com/cloudant/sync-android/issues/577

Google Dataflow writing insufficient data to datastore

One of my Batch-Jobs tonight failed with a Runtime-Exception. It writes Data to Datastore like 200 other jobs that were running tonight. This one failed with a very long list auf causes, the root of it should be this:
Caused by: com.google.datastore.v1.client.DatastoreException: I/O error, code=UNAVAILABLE
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:95)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:925)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:892)
Caused by: java.io.IOException: insufficient data written
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.close(HttpURLConnection.java:3501)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:81)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:87)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:925)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:892)
at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:139)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:188)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.processElement(ForwardingParDoFn.java:42)
at com.google.cloud.dataflow.sdk.runners.
How can this happen? It's very similar to all the other jobs I run. I am using the Dataflow-Version 1.9.0 and the standard DatastoreIO.v1().write....
The jobIds with this error message:
2017-08-29_17_05_19-6961364220840664744
2017-08-29_16_40_46-15665765683196208095
Is it possible to retrieve the errors/logs of a job from an outside application (Not cloud console) to automatically being able to restart jobs, if they would usually succeed and fail because of quota-issues or other reasons that are temporary?
Thanks in advance
This is most likely because DatastoreIO is trying to write more mutations in one RPC call than the Datastore RPC size limit allows. This is data-dependent - suppose the data for this job differs somewhat from data for other jobs. In any case: this issue was fixed in 2.1.0 - updating your SDK should help.

"KinesisClientLibIOException: Shard is not closed" on a DynamoDB Stream

I have a DynamoDB table to which I added a Stream. I created a Lambda to process this stream and test throughput, latency, etc. After finishing my tests I deleted the lambda's trigger.
Then I proceeded to test the same table with the Python MultiLangDaemon client, to compare and also see if it could pick up where the lambda left.
The Daemon starts processing shards and blows up with the exception below. Searching for it, I only found this answer, which did not apply. I tried deleting the DynamoDB table used to track the worker and have the MultiLangDaemon recreate it. Same thing happened.
Why does this happen and how can I recover without losing my data in the stream?
SEVERE: Caught exception:
com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Shard [shardId-00000001500614265247-3b7a2849, shardId-00000001500628985464-c896556e] is not closed. This can happen if we constructed the list of shards while a reshard operation was in progress.
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.assertClosedShardsAreCoveredOrAbsent(ShardSyncer.java:206)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.cleanupLeasesOfFinishedShards(ShardSyncer.java:652)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.syncShardLeases(ShardSyncer.java:141)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.checkAndCreateLeasesForNewShards(ShardSyncer.java:88)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:122)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Resources