"KinesisClientLibIOException: Shard is not closed" on a DynamoDB Stream

"KinesisClientLibIOException: Shard is not closed" on a DynamoDB Stream - amazon-dynamodb

I have a DynamoDB table to which I added a Stream. I created a Lambda to process this stream and test throughput, latency, etc. After finishing my tests I deleted the lambda's trigger.
Then I proceeded to test the same table with the Python MultiLangDaemon client, to compare and also see if it could pick up where the lambda left.
The Daemon starts processing shards and blows up with the exception below. Searching for it, I only found this answer, which did not apply. I tried deleting the DynamoDB table used to track the worker and have the MultiLangDaemon recreate it. Same thing happened.
Why does this happen and how can I recover without losing my data in the stream?
SEVERE: Caught exception:
com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Shard [shardId-00000001500614265247-3b7a2849, shardId-00000001500628985464-c896556e] is not closed. This can happen if we constructed the list of shards while a reshard operation was in progress.
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.assertClosedShardsAreCoveredOrAbsent(ShardSyncer.java:206)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.cleanupLeasesOfFinishedShards(ShardSyncer.java:652)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.syncShardLeases(ShardSyncer.java:141)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.checkAndCreateLeasesForNewShards(ShardSyncer.java:88)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:122)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Related

Spring kafka MessageListener and max.poll.records

I am using spring kafka 2.7.8 to consume messages from kafka.
Consumer listener is as below
#KafkaListener(topics = "topicName",
groupId = "groupId",
containerFactory = "kafkaListenerFactory")
public void onMessage(ConsumerRecord record) {
}
Above onMessage method receives single message at a time.
Does this mean max.poll.records is set to 1 by spring library or it polls 500 at a time(default value) and the method receives one by one.
Reason for this question is, we often see below errors together in prod.
Received all 4 errors below for multiple consumers in under a minute.
Trying to understand whether it is due to intermittent kafka broker connectivity issue or due to load. Please advise.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
seek to current after exception; nested exception is org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {topic-9=OffsetAndMetadata{offset=2729058, leaderEpoch=null, metadata=''}}
Consumer clientId=consumer-groupName-5, groupId=consumer] Offset commit failed on partition topic-33 at offset 2729191: The coordinator is not aware of this member.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records

max.poll.records is not changed by Spring; it will take the default (or whatever you set it to). The records are handed to the listener one at a time before the next poll.
This means that your listener must be able to process max.poll.records within max.poll.interval.ms.
You need to reduce max.poll.records and/or increase max.poll.interval.ms so that you can process the records in that time, with a generous margin, to avoid these rebalances.

Embedded hazelcast cluster occasionally breaks for no apparent reason

The hazelcast cluster runs in an application running on Kubernetes. I can't see any traces of partitioning or other problems in the logs. At some point, this exception starts to appear in the logs:
hz.dazzling_morse.partition-operation.thread-1 com.hazelcast.logging.StandardLoggerFactory$StandardLogger: app-name, , , , , - [172.30.67.142]:5701 [app-name] [4.1.5] Executor is shut down.
java.util.concurrent.RejectedExecutionException: Executor is shut down.
at com.hazelcast.scheduledexecutor.impl.operations.AbstractSchedulerOperation.checkNotShutdown(AbstractSchedulerOperation.java:73)
at com.hazelcast.scheduledexecutor.impl.operations.AbstractSchedulerOperation.getContainer(AbstractSchedulerOperation.java:65)
at com.hazelcast.scheduledexecutor.impl.operations.SyncBackupStateOperation.run(SyncBackupStateOperation.java:39)
at com.hazelcast.spi.impl.operationservice.Operation.call(Operation.java:184)
at com.hazelcast.spi.impl.operationexecutor.OperationRunner.runDirect(OperationRunner.java:150)
at com.hazelcast.spi.impl.operationservice.impl.operations.Backup.run(Backup.java:174)
at com.hazelcast.spi.impl.operationservice.Operation.call(Operation.java:184)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.call(OperationRunnerImpl.java:256)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:237)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:452)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:166)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:136)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.executeRun(OperationThread.java:123)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:102)
I can't see any particular operation failing, prior to that. I do run some scheduled operations myself, but they are executing inside try-catch blocks and are never throwing.
The consequence is that whenever a node in the cluster restarts no data is replicated to the new node, which eventually renders the entire cluster useless - all data that's supposed to be cached and replicated among nodes disappears.
What could be the cause? How can I get more details about what causes whatever executor hazelcast uses to shut down?

Based on other conversations...
Your Runnable / Callable should implement HazelcastInstanceAware.
Don't pass the HazelcastInstance or IExecutorService as a non-transient argument... as the instance where the runnable is submitted will be different from the one where it runs.
See this.

InvalidPidMappingException in KafkaListener BeginTransaction

We have been getting lots of InvalidPidMappingException failures recently.
I came across https://github.com/spring-projects/spring-kafka/issues/660, but this is specific to producerOnly publishing in tx.
In our case, this is happening in consumers, so it cannot be resolved using retries (keeps failing with each rollback retry).
There's a fix in kafka to address this exception, but will probably be released in 4 months (v2.5.0). https://github.com/apache/kafka/pull/7115
Is there a way we can recover from these failures?
One option we're considering is to use ContainerAwareErrorHandler to restart the listenercontainer in case of these 2 specific exceptions. Any issues with this approach? what would happen if another listener is processing another message?
org.apache.kafka.common.KafkaException: Cannot execute transactional method because we are in an error state
at org.apache.kafka.clients.producer.internals.TransactionManager.maybeFailWithError(TransactionManager.java:924)
at org.apache.kafka.clients.producer.internals.TransactionManager.beginTransaction(TransactionManager.java:290)
at org.apache.kafka.clients.producer.KafkaProducer.beginTransaction(KafkaProducer.java:642)
at org.springframework.kafka.core.DefaultKafkaProducerFactory$CloseSafeProducer.beginTransaction(DefaultKafkaProducerFactory.java:597)
at org.springframework.kafka.core.ProducerFactoryUtils.getTransactionalResourceHolder(ProducerFactoryUtils.java:101)
at org.springframework.kafka.transaction.KafkaTransactionManager.doBegin(KafkaTransactionManager.java:160)
at com.projectdrgn.common.messaging.config.KafkaProducerConfig$kafkaTransactionManager$kafkaTransactionManager$1.doBegin(KafkaProducerConfig.kt:65)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:376)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:137)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.recordAfterRollback(KafkaMessageListenerContainer.java:1463)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeRecordListenerInTx(KafkaMessageListenerContainer.java:1438)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeRecordListener(KafkaMessageListenerContainer.java:1398)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeListener(KafkaMessageListenerContainer.java:1165)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:949)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:884)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.
at org.apache.kafka.clients.producer.internals.TransactionManager$AddPartitionsToTxnHandler.handleResponse(TransactionManager.java:1204)
at org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1069)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:561)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:553)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:307)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
... 1 common frames omitted
Caused by: org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.

Google Dataflow writing insufficient data to datastore

One of my Batch-Jobs tonight failed with a Runtime-Exception. It writes Data to Datastore like 200 other jobs that were running tonight. This one failed with a very long list auf causes, the root of it should be this:
Caused by: com.google.datastore.v1.client.DatastoreException: I/O error, code=UNAVAILABLE
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:95)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:925)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:892)
Caused by: java.io.IOException: insufficient data written
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.close(HttpURLConnection.java:3501)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:81)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:87)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:925)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:892)
at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:139)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:188)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.processElement(ForwardingParDoFn.java:42)
at com.google.cloud.dataflow.sdk.runners.
How can this happen? It's very similar to all the other jobs I run. I am using the Dataflow-Version 1.9.0 and the standard DatastoreIO.v1().write....
The jobIds with this error message:
2017-08-29_17_05_19-6961364220840664744
2017-08-29_16_40_46-15665765683196208095
Is it possible to retrieve the errors/logs of a job from an outside application (Not cloud console) to automatically being able to restart jobs, if they would usually succeed and fail because of quota-issues or other reasons that are temporary?
Thanks in advance

This is most likely because DatastoreIO is trying to write more mutations in one RPC call than the Datastore RPC size limit allows. This is data-dependent - suppose the data for this job differs somewhat from data for other jobs. In any case: this issue was fixed in 2.1.0 - updating your SDK should help.

titan-hbase-solr graph load gremlin-server java.lang.OutOfMemoryError: GC overhead limit exceeded

We are trying to create a huge graph (around 100,000 vertices) in titan using gremlin server remote connection. We have followed the example code available at https://github.com/pluradj/titan-tp3-driver-example to create remote connection to titan via gremlin server. We are able to create indices, vertices, edges query the simple graphs created without any problem;
However, when we try to create a huge graph using a generator (it creates vertices and edges directly in the server using remote connection established) , we are getting the following error:
6041316 [gremlin-server-exec-6] WARN org.apache.tinkerpop.gremlin.server.op.AbstractEvalOpProcessor - Exception processing a script on request [RequestMessage{, requestId=81f949ad-0e37-4293-bcaa-0714cb159c3b, op='eval', processor='', args={gremlin=g.V().has('idObj', 'OC97').next().addEdge('OC_LC', g.V().has('idObj', 'LC9643').next()), batchSize=64}}].
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.codehaus.groovy.reflection.CachedClass$3.initValue(CachedClass.java:106)
at org.codehaus.groovy.reflection.CachedClass$3.initValue(CachedClass.java:84)
at org.codehaus.groovy.util.LazyReference.getLocked(LazyReference.java:49)
at org.codehaus.groovy.util.LazyReference.get(LazyReference.java:36)
at org.codehaus.groovy.reflection.CachedClass.getMethods(CachedClass.java:260)
at groovy.lang.MetaClassImpl.addInterfaceMethods(MetaClassImpl.java:419)
at groovy.lang.MetaClassImpl.fillMethodIndex(MetaClassImpl.java:342)
at groovy.lang.MetaClassImpl.initialize(MetaClassImpl.java:3264)
at org.codehaus.groovy.reflection.ClassInfo.getMetaClassUnderLock(ClassInfo.java:254)
at org.codehaus.groovy.reflection.ClassInfo.getMetaClass(ClassInfo.java:285)
at org.codehaus.groovy.reflection.ClassInfo.getMetaClass(ClassInfo.java:295)
at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:261)
at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:873)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.createPojoSite(CallSiteArray.java:125)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.createCallSite(CallSiteArray.java:166)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:133)
at Script72559.run(Script72559.groovy:1)
at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:534)
at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:374)
at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:233)
at org.apache.tinkerpop.gremlin.groovy.engine.ScriptEngines.eval(ScriptEngines.java:102)
at org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor.lambda$eval$0(GremlinExecutor.java:258)
at org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor$$Lambda$137/1500273035.call(Unknown Source)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The graph generation is fast in the begining and slows down gradually and fails around 31000 vertices throwing the above error.
We have tried changing the default cache parameters as below
cache.db-cache=true
cache.db-cache-clean-wait=0
cache.db-cache-time=10000
cache.db-cache-size=0.1
Also we have tried deactivating cache by setting cache.db-cache=false. But none of the steps worked for us.
#Our environment:
CDH 5.7.1
Titan 1.1.0-SNAPSHOT
Solr 4.10.3
HBase 1.2.0
Could you please guide us how to overcome this problem?

The problem was forgetting to use Parameterized Scripts http://tinkerpop.apache.org/docs/current/reference/#parameterized-scripts
Gremlin Server caches all scripts that are passed to it: using Parameterized Scripts reduces caching because only not common scripts are cached (g.V(x))
Not using Parameterized Scripts and using instead String.format for example (like we did) implies caching all gremlin scripts separately which is very expensive and causes an OutOfMemoryError
I hope this would help someone ;)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

"KinesisClientLibIOException: Shard is not closed" on a DynamoDB Stream - amazon-dynamodb

Related

Spring kafka MessageListener and max.poll.records

Embedded hazelcast cluster occasionally breaks for no apparent reason

InvalidPidMappingException in KafkaListener BeginTransaction

Google Dataflow writing insufficient data to datastore

titan-hbase-solr graph load gremlin-server java.lang.OutOfMemoryError: GC overhead limit exceeded

Categories

Resources