How can I solve this problem loading Grakn schema and data at localhost:48555? - vaticle-typedb

I'm using grakn core 1.8.4 with Windows 10. The grakn server and grakn storage are starting up normally, but when trying to load a schema, Grakn returns the following error message:
Unable to create connection to Grakn instance at localhost:48555
Cause: io.grpc.StatusRuntimeException
UNKNOWN: ID block allocation on partition(30)-namespace(0) timed out in 2.000 min. Please check server logs for the stack trace.
I already checked and there are no other processes listening to the same port. I also disabled the Firewall and that did not solve the problem. Does anyone have any indication of how I should proceed?
Below is part of the log:
2021-01-05 14:19:24,387 [JanusGraphID(30)(0)[0]] WARN g.c.g.d.i.ConsistentKeyIDAuthority - Temporary storage exception while acquiring id block - retrying in PT0.32S: {}
grakn.core.graph.diskstorage.TemporaryBackendException: Wrote claim for id block [1, 10001) in PT0.016S => too slow, threshold is: PT0.01S
at grakn.core.graph.diskstorage.idmanagement.ConsistentKeyIDAuthority.getIDBlock(ConsistentKeyIDAuthority.java:320)
at grakn.core.graph.graphdb.database.idassigner.StandardIDPool$IDBlockGetter.call(StandardIDPool.java:262)
at grakn.core.graph.graphdb.database.idassigner.StandardIDPool$IDBlockGetter.call(StandardIDPool.java:232)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
2021-01-05 14:19:24,688 [grpc-request-handler-1] ERROR grakn.core.server.rpc.SessionService - An error has occurred
grakn.core.graph.core.JanusGraphException: ID block allocation on partition(30)-namespace(0) timed out in 2.000 min
at grakn.core.graph.graphdb.database.idassigner.StandardIDPool.waitForIDBlockGetter(StandardIDPool.java:146)
at grakn.core.graph.graphdb.database.idassigner.StandardIDPool.nextBlock(StandardIDPool.java:165)
at grakn.core.graph.graphdb.database.idassigner.StandardIDPool.nextID(StandardIDPool.java:185)
at grakn.core.graph.graphdb.database.idassigner.VertexIDAssigner.assignID(VertexIDAssigner.java:334)
at grakn.core.graph.graphdb.database.idassigner.VertexIDAssigner.assignID(VertexIDAssigner.java:184)
at grakn.core.graph.graphdb.database.idassigner.VertexIDAssigner.assignID(VertexIDAssigner.java:154)
at grakn.core.graph.graphdb.database.StandardJanusGraph.assignID(StandardJanusGraph.java:416)
at grakn.core.graph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:636)
at grakn.core.graph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:653)
at grakn.core.graph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:649)
at grakn.core.concept.structure.ElementFactory.addVertexElement(ElementFactory.java:104)
at grakn.core.concept.manager.ConceptManagerImpl.addTypeVertex(ConceptManagerImpl.java:188)
at grakn.core.server.session.TransactionImpl.createMetaConcepts(TransactionImpl.java:1297)
at grakn.core.server.session.SessionImpl.initialiseMetaConcepts(SessionImpl.java:123)
at grakn.core.server.session.SessionImpl.<init>(SessionImpl.java:85)
at grakn.core.server.session.SessionFactory.session(SessionFactory.java:115)
at grakn.core.server.rpc.ServerOpenRequest.open(ServerOpenRequest.java:40)
at grakn.core.server.rpc.SessionService.open(SessionService.java:122)
at grakn.protocol.session.SessionServiceGrpc$MethodHandlers.invoke(SessionServiceGrpc.java:339)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(Unknown Source)
at grakn.core.graph.graphdb.database.idassigner.StandardIDPool.waitForIDBlockGetter(StandardIDPool.java:126)
... 26 common frames omitted

If you get this error, you can try to increase the wait time with the following two parameters in the grakn.properties file:
# The number of milliseconds that the JanusGraph id pool manager will wait before
# giving up on allocating a new block of ids
root.ids.renew-timeout=10
# The number of milliseconds the system waits for an ID block reservation to be acknowledged by the storage backend
ids.authority.wait-time=10

I solved this by increasing the below parameters from 10 o 100 in the .\grakn\server\conf\grakn.properties configuration file.
Not sure if this affects performances.
# The number of milliseconds that the JanusGraph id pool manager will wait before
# giving up on allocating a new block of ids
root.ids.renew-timeout=100
# The number of milliseconds the system waits for an ID block reservation to be acknowledged by the storage backend
ids.authority.wait-time=100

Related

wso2 apimanager Active-Active deployment

I have deploy API manager 4.0.0 All-in-one on 2 VMs, front the system with a load balancer.
When one node shutdown by command "sh api-manager.sh stop", another swithes success and runs well , but there are some error in console like below:
TID: [-1] [] [2022-03-14 10:14:13,270] ERROR {org.wso2.carbon.databridge.agent.endpoint.DataEndpointConnectionWorker} - Error while trying to connect
to the endpoint. Cannot borrow client for ssl://10.32.73.10:9711 org.wso2.carbon.databridge.agent.exception.DataEndpointAuthenticationException: Cannot borrow client for ssl://10.32.73.10:9711 at org.wso2.carbon.databridge.agent.endpoint.DataEndpointConnectionWorker.connect(DataEndpointConnectionWorker.java:147)
at org.wso2.carbon.databridge.agent.endpoint.DataEndpointConnectionWorker.run(DataEndpointConnectionWorker.java:59)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.wso2.carbon.databridge.agent.exception.DataEndpointException: Error while opening socket to 10.32.73.10:9711. Connection refused (Conne
ction refused) at org.wso2.carbon.databridge.agent.endpoint.binary.BinarySecureClientPoolFactory.createClient(BinarySecureClientPoolFactory.java:75)
at org.wso2.carbon.databridge.agent.client.AbstractClientPoolFactory.makeObject(AbstractClientPoolFactory.java:39)
at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1212)
at org.wso2.carbon.databridge.agent.endpoint.DataEndpointConnectionWorker.connect(DataEndpointConnectionWorker.java:137)
... 6 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394)
at java.net.Socket.connect(Socket.java:606)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:287)
at sun.security.ssl.SSLSocketImpl.<init>(SSLSocketImpl.java:146)
at sun.security.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:88)
at org.wso2.carbon.databridge.agent.endpoint.binary.BinarySecureClientPoolFactory.createClient(BinarySecureClientPoolFactory.java:58)
... 9 more
TID: [-1] [] [2022-03-14 10:14:15,158] WARN {org.wso2.carbon.databridge.agent.endpoint.DataEndpointGroup} - No receiver is reachable at URL Endpoint/
Endpoints [tcp://10.32.73.10:9611], will try to reconnect every 30 sec
Are there anything wrong in the deployment.toml?
In APIM active-active setup, each node is publishing throttling data to itself and to the other node. When you stop the other node, it can't publish the throttling data to the other node. Hence you see connection refused errors and this is expected. No harm having these error logs. It will recover when the other node is started. If you look at the deployment.toml, can find the other node details under the throttling configurations.

apache ignite out of memory exception

I got out of memory exception and ignite got crashed. After going through the ignite logs, in last metrics I could see heap, off-heap memory usage was about 171 MB,70MB respectively and after 10 secs, ignite logs shows out of memory exception. also, other flags in metrics looks ok
Below is the log snippet
[01:04:29,690][INFO][grid-timeout-worker-#22][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=8a034404, uptime=39 days, 15:50:23.086]
^-- Cluster [hosts=1, CPUs=4, servers=1, clients=1, topVer=22, minorTopVer=0]
^-- Network [addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.28.230.222], discoPort=47500, commPort=47100]
^-- CPU [CPUs=4, curLoad=0.07%, avgLoad=0.15%, GC=0%]
^-- Heap [used=171MB, free=95.15%, comm=254MB]
^-- Off-heap memory [used=70MB, free=98.02%, allocated=3377MB]
^-- Page memory [pages=17878]
^-- sysMemPlc region [type=internal, persistence=true, lazyAlloc=false,
... initCfg=40MB, maxCfg=100MB, usedRam=0MB, freeRam=99.98%, allocRam=100MB, allocTotal=0MB]
^-- default region [type=default, persistence=true, lazyAlloc=true,
... initCfg=256MB, maxCfg=3177MB, usedRam=70MB, freeRam=97.78%, allocRam=3177MB, allocTotal=69MB]
^-- metastoreMemPlc region [type=internal, persistence=true, lazyAlloc=false,
... initCfg=40MB, maxCfg=100MB, usedRam=0MB, freeRam=99.95%, allocRam=0MB, allocTotal=0MB]
^-- TxLog region [type=internal, persistence=true, lazyAlloc=false,
... initCfg=40MB, maxCfg=100MB, usedRam=0MB, freeRam=100%, allocRam=100MB, allocTotal=0MB]
^-- volatileDsMemPlc region [type=user, persistence=false, lazyAlloc=true,
... initCfg=40MB, maxCfg=100MB, usedRam=0MB, freeRam=100%, allocRam=0MB]
^-- Ignite persistence [used=69MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=7, qSize=0]
^-- Striped thread pool [active=0, idle=8, qSize=0]
[01:04:38,584][INFO][db-checkpoint-thread-#104][Checkpointer] Checkpoint started [checkpointId=41e99f38-7359-4af1-945f-61c92d2a5fb7, startPtr=WALPointer [idx=147, fileOff=11684440, len=381549], checkpointBeforeLockTime=9ms, checkpointLockWait=0ms, checkpointListenersExecuteTime=17ms, checkpointLockHoldTime=19ms, walCpRecordFsyncDuration=2ms, writeCheckpointEntryDuration=2ms, splitAndSortCpPagesDuration=0ms, pages=9, reason='timeout']
[01:04:38,619][SEVERE][db-checkpoint-thread-#104][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Compound exception for CountDownFuture.]]
class org.apache.ignite.IgniteCheckedException: Compound exception for CountDownFuture.
at org.apache.ignite.internal.util.future.CountDownFuture.addError(CountDownFuture.java:72)
at org.apache.ignite.internal.util.future.CountDownFuture.onDone(CountDownFuture.java:46)
at org.apache.ignite.internal.util.future.CountDownFuture.onDone(CountDownFuture.java:28)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:478)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.run(CheckpointPagesWriter.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Suppressed: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.addWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
at sun.nio.ch.SimpleAsynchronousFileChannelImpl.implWrite(Unknown Source)
at sun.nio.ch.AsynchronousFileChannelImpl.write(Unknown Source)
at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:177)
at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:117)
at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:53)
at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:115)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:748)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageReadWriteManagerImpl.write(PageReadWriteManagerImpl.java:116)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.write(FilePageStoreManager.java:636)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointManager.lambda$new$0(CheckpointManager.java:175)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter$1.writePage(CheckpointPagesWriter.java:266)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.copyPageForCheckpoint(PageMemoryImpl.java:1343)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.checkpointWritePage(PageMemoryImpl.java:1250)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.writePages(CheckpointPagesWriter.java:207)
at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointPagesWriter.run(CheckpointPagesWriter.java:151)
... 3 more
[01:04:38,620][SEVERE][db-checkpoint-thread-#104][FailureProcessor] No deadlocked threads detected.
[01:04:38,749][SEVERE][db-checkpoint-thread-#104][FailureProcessor] Thread dump at 2022/02/06 01:04:38 CST
unable to create new native thread
This seems to be a non-Ignite exception and most likely is about your system configuration.
Check your Process File Descriptor Limit by running the ulimit -a command and increase it if required. The recommended value is 32768 or above. If it requires an adjustment that can be accomplished by either running ulimit -n 32768 -u 32768 or by modifying the /etc/security/limits.conf

AcquireJobsRunnableImpl trows PSQLException: SSL error: readHandshakeRecord

With a successfully migrated Alfresco (5.2 to) 7.0 repository, I get PSQLException: SSL error: readHandshakeRecord (see stracktrace below) every morning at 4:00, which then causes the repository to stop responding.
Could someone please help me decipher this stack trace? Why is this job running around 4am? I can't find a suitable quartz job. Does anyone know how to manually force this call to fix this problem? At first I thought it might be related to the contentStoreCleaner running at 4:00 am, but disabling this job doesn't change anything.
The only work around I found so far was to disable the activity workflow engine.
2021-08-08 04:35:39,396 ERROR [org.activiti.engine.impl.jobexecutor.AcquireJobsRunnableImpl] [Thread-46] exception during job acquisition: Could not open JDBC Connection for transaction; nested exception is org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:309)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.startTransaction(AbstractPlatformTransactionManager.java:400)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:373)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:137)
at org.activiti.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:45)
at org.activiti.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:31)
at org.activiti.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:40)
at org.activiti.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:35)
at org.activiti.engine.impl.jobexecutor.AcquireJobsRunnableImpl.run(AcquireJobsRunnableImpl.java:54)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:43)
at org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:534)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:149)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:213)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:51)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:223)
at org.postgresql.Driver.makeConnection(Driver.java:465)
at org.postgresql.Driver.connect(Driver.java:264)
at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582)
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1188)
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)
at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:265)
... 9 more
Caused by: javax.net.ssl.SSLException: readHandshakeRecord
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1335)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:440)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411)
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:41)
... 22 more
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:380)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:292)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:450)
... 24 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/sun.security.ssl.SSLSocketOutputRecord.flush(SSLSocketOutputRecord.java:251)
at java.base/sun.security.ssl.HandshakeOutStream.flush(HandshakeOutStream.java:89)
at java.base/sun.security.ssl.Finished$T13FinishedProducer.onProduceFinished(Finished.java:679)
at java.base/sun.security.ssl.Finished$T13FinishedProducer.produce(Finished.java:658)
at java.base/sun.security.ssl.SSLHandshake.produce(SSLHandshake.java:436)
at java.base/sun.security.ssl.Finished$T13FinishedConsumer.onConsumeFinished(Finished.java:1011)
at java.base/sun.security.ssl.Finished$T13FinishedConsumer.consume(Finished.java:874)
at java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392)
at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:443)
at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:421)
at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:182)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:171)
at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1418)
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1324)
... 25 more
The stacktrace was misleading - the jdbc connection problem was caused by memory filling up by the Alfresco trashcan cleaner module: OutOfMemoryError: Java heap space due to no limit on getChildAssocs.
We have ~ 4 million nodes in the trashcan and the module retrieves in every batch run all nodes again and again until the memory has been filled up ...

SolrCloud with Zookeeper - cancel_stream_error & TimeoutException: Idle timeout expired: 120000/120000 ms

I have a solrCloud setup in Kubernetes with 2 Solr instances and 3 ZooKeeper instances with 1 shard. It is configured with 8G persistent storage for each Solr and Zookeeper. The Memory allocated for Solr is 16G with 10G Heap size. There are a max of 2.5million records indexed. There scheduler client which will call the Solr with url - /update/json?wt=json&commit=true - to do the add/update/delete operations. Occasionally there will be a huge update/delete happens with 1 million records which will call the api (/update/json?wt=json&commit=true ) with 500 documents at a time, but this is called in multiple threads. Everything works fine 1 week, but suddenly we saw errors in Solr.log which makes the solr in an error state and I had to restart one of the solr node. The error are:
Node 1:
021-04-09 08:20:56.657 ERROR (updateExecutor-5-thread-169-processing-x:datacore_shard1_replica_n1 r:core_node3 null n:solr-1.solrcluster:8983_solr c:datacore s:shard1) [c:datacore s:shard1 r:core_node3 x:datacore_shard1_replica_n1] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=add{,id=S-170262-P-108028200-F-800001737-E-180905508}; node=ForwardNode: http://solr-0.solrcluster:8983/solr/datacore_shard1_replica_n2/ to http://solr-0.solrcluster:8983/solr/datacore_shard1_replica_n2/ => java.io.IOException: java.io.IOException: cancel_stream_error
at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193)
java.io.IOException: java.io.IOException: cancel_stream_error
at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193) ~[?:?]
Node2:
2021-04-09 08:22:56.661 INFO (qtp1632497828-35124) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.p.LogUpdateProcessorFactory [datacore_shard1_replica_n2] webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=http://solr-1.solrcluster:8983/solr/datacore_shard1_replica_n1/&wt=javabin&version=2}{} 0 119999
2021-04-09 08:22:56.661 ERROR (qtp1632497828-35124) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.h.RequestHandlerBase java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 120000/120000 ms
at org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1085)
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:318)
And on both nodes we can see the below error as well -
2021-04-09 08:21:00.812 INFO (qtp1632497828-35036) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.p.LogUpdateProcessorFactory [datacore_shard1_replica_n2] webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=http://solr-1.solrcluster:8983/solr/datacore_shard1_replica_n1/&wt=javabin&version=2}{} 0 120770
2021-04-09 08:21:00.812 ERROR (qtp1632497828-35036) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.h.RequestHandlerBase java.io.IOException: Task queue processing has stalled for 90013 ms with 0 remaining elements to process.
at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.blockUntilFinished(ConcurrentUpdateHttp2SolrClient.java:501)
The stall time is set at 90000ms.
Why we are getting these errors?
Why is it stalling for long? We have the average doc size of 1Kb.
How can we resolve this problem?

StatusRuntimeException: UNKNOWN caused by ChannelClosedException

We are seeing these errors on the client side sporadically.
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: channel closed
at io.grpc.Status.asRuntimeException(Status.java:532)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:398)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.nio.channels.ClosedChannelException: null
at io.grpc.netty.Utils.statusFromThrowable(Utils.java:166)
at io.grpc.netty.NettyClientHandler.onConnectionError(NettyClientHandler.java:474)
at io.netty.handler.codec.http2.Http2ConnectionHandler.onError(Http2ConnectionHandler.java:641)
at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:225)
at io.netty.handler.codec.http2.DecoratingHttp2FrameWriter.writeHeaders(DecoratingHttp2FrameWriter.java:53)
at io.netty.handler.codec.http2.StreamBufferingEncoder.writeHeaders(StreamBufferingEncoder.java:157)
at io.netty.handler.codec.http2.StreamBufferingEncoder.writeHeaders(StreamBufferingEncoder.java:141)
at io.grpc.netty.NettyClientHandler.createStream(NettyClientHandler.java:543)
at io.grpc.netty.NettyClientHandler.write(NettyClientHandler.java:312)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:716)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:708)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:791)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:701)
at io.netty.channel.DefaultChannelPipeline.write(DefaultChannelPipeline.java:1026)
at io.netty.channel.AbstractChannel.write(AbstractChannel.java:288)
at io.grpc.netty.WriteQueue$AbstractQueuedCommand.run(WriteQueue.java:174)
at io.grpc.netty.WriteQueue.flush(WriteQueue.java:112)
at io.grpc.netty.WriteQueue.access$000(WriteQueue.java:32)
at io.grpc.netty.WriteQueue$1.run(WriteQueue.java:44)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
Caused by: java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:955)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:863)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1378)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:716)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:708)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:791)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:701)
at io.netty.handler.codec.http2.DefaultHttp2FrameWriter.writeHeadersInternal(DefaultHttp2FrameWriter.java:528)
at io.netty.handler.codec.http2.DefaultHttp2FrameWriter.writeHeaders(DefaultHttp2FrameWriter.java:268)
at io.netty.handler.codec.http2.Http2OutboundFrameLogger.writeHeaders(Http2OutboundFrameLogger.java:60)
at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:208)
... 22 common frames omitted
Caused by: java.io.IOException: Connection reset by peer
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:58)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:50)
at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:466)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:405)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:928)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:356)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895)
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1383)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:727)
at io.grpc.netty.NettyClientHandler.sendPingFrame(NettyClientHandler.java:646)
at io.grpc.netty.NettyClientHandler.write(NettyClientHandler.java:318)
... 17 common frames omitted
Both the server and the client run in Kubernetes with the server being a ClusterIP service. The client's channel builder looks like this:
NettyChannelBuilder.forAddress("some-service", 8090)
.nameResolverFactory(DnsNameResolverProvider.asFactory())
.defaultLoadBalancingPolicy("round_robin")
.idleTimeout(60000, TimeUnit.MILLISECONDS)
.usePlaintext();
The client sends requests in a burst of 4-5 (concurrently using project-reactor) every 5 minutes. We see failures happening once or twice a day. We have tried setting both keepAlive and idleTimeout on separate occasions but nothing seems to be working. We have retries defined for UNAVAILABLE and DEADLINE_EXCEEDED status codes but not UNKNOWN since that could mean retrying on potentially non-retriable errors. Is there a way to fix this on the client side without having to retry on UNKNOWN errors?
Client gRPC version: 1.20.0
Server gRPC version: 1.20.0
Netty Version: 4.1.39.Final

Resources