SolrCloud with Zookeeper - cancel_stream_error & TimeoutException: Idle timeout expired: 120000/120000 ms - solrcloud

I have a solrCloud setup in Kubernetes with 2 Solr instances and 3 ZooKeeper instances with 1 shard. It is configured with 8G persistent storage for each Solr and Zookeeper. The Memory allocated for Solr is 16G with 10G Heap size. There are a max of 2.5million records indexed. There scheduler client which will call the Solr with url - /update/json?wt=json&commit=true - to do the add/update/delete operations. Occasionally there will be a huge update/delete happens with 1 million records which will call the api (/update/json?wt=json&commit=true ) with 500 documents at a time, but this is called in multiple threads. Everything works fine 1 week, but suddenly we saw errors in Solr.log which makes the solr in an error state and I had to restart one of the solr node. The error are:
Node 1:
021-04-09 08:20:56.657 ERROR (updateExecutor-5-thread-169-processing-x:datacore_shard1_replica_n1 r:core_node3 null n:solr-1.solrcluster:8983_solr c:datacore s:shard1) [c:datacore s:shard1 r:core_node3 x:datacore_shard1_replica_n1] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=add{,id=S-170262-P-108028200-F-800001737-E-180905508}; node=ForwardNode: http://solr-0.solrcluster:8983/solr/datacore_shard1_replica_n2/ to http://solr-0.solrcluster:8983/solr/datacore_shard1_replica_n2/ => java.io.IOException: java.io.IOException: cancel_stream_error
at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193)
java.io.IOException: java.io.IOException: cancel_stream_error
at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193) ~[?:?]
Node2:
2021-04-09 08:22:56.661 INFO (qtp1632497828-35124) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.p.LogUpdateProcessorFactory [datacore_shard1_replica_n2] webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=http://solr-1.solrcluster:8983/solr/datacore_shard1_replica_n1/&wt=javabin&version=2}{} 0 119999
2021-04-09 08:22:56.661 ERROR (qtp1632497828-35124) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.h.RequestHandlerBase java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 120000/120000 ms
at org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1085)
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:318)
And on both nodes we can see the below error as well -
2021-04-09 08:21:00.812 INFO (qtp1632497828-35036) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.p.LogUpdateProcessorFactory [datacore_shard1_replica_n2] webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=http://solr-1.solrcluster:8983/solr/datacore_shard1_replica_n1/&wt=javabin&version=2}{} 0 120770
2021-04-09 08:21:00.812 ERROR (qtp1632497828-35036) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.h.RequestHandlerBase java.io.IOException: Task queue processing has stalled for 90013 ms with 0 remaining elements to process.
at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.blockUntilFinished(ConcurrentUpdateHttp2SolrClient.java:501)
The stall time is set at 90000ms.
Why we are getting these errors?
Why is it stalling for long? We have the average doc size of 1Kb.
How can we resolve this problem?

Related

Mariadb Galera 10.5.13-16 Node Crash

I have a cluster with 2 galera nodes and 1 arbitrator.
My node 1 crashed I don't understand why..
Here is the log of the node 1.
It seems that it is a problem with the pthread library.
Also every requests are proxied by 2 HAProxy.
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:56 0 [Warning] WSREP: Handshake failed: http request
terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>'
what(): remote_endpoint: Transport endpoint is not connected
230103 12:08:56 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.5.13-MariaDB-1:10.5.13+maria~focal
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=101
max_threads=102
thread_count=106
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 760333 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mariadbd(my_print_stacktrace+0x32)[0x55b1b67f7e42]
Printing to addr2line failed
mariadbd(handle_fatal_signal+0x485)[0x55b1b62479a5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff88ea983c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff88e59e18b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff88e57d859]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff88e939911]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff88e94538c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff88e9453f7]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff88e9456a9]
/usr/lib/galera/libgalera_smm.so(+0x448ad)[0x7ff884b5e8ad]
/usr/lib/galera/libgalera_smm.so(+0x1fc315)[0x7ff884d16315]
/usr/lib/galera/libgalera_smm.so(+0x1ff7eb)[0x7ff884d197eb]
/usr/lib/galera/libgalera_smm.so(+0x1ffc28)[0x7ff884d19c28]
/usr/lib/galera/libgalera_smm.so(+0x2065b6)[0x7ff884d205b6]
/usr/lib/galera/libgalera_smm.so(+0x1f81f3)[0x7ff884d121f3]
/usr/lib/galera/libgalera_smm.so(+0x1e6f04)[0x7ff884d00f04]
/usr/lib/galera/libgalera_smm.so(+0x103438)[0x7ff884c1d438]
/usr/lib/galera/libgalera_smm.so(+0xe8eea)[0x7ff884c02eea]
/usr/lib/galera/libgalera_smm.so(+0xe9a8d)[0x7ff884c03a8d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff88ea8c609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff88e67a293]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Fatal signal 11 while backtracing
PS: if you want more data ask me :)
OK it seems that 2 simultaneous scans of OpenVAS crashes the node.
I tried with version 10.5.13 and 10.5.16 -> crash.
Solution: Upgrade to 10.5.17 at least.

AcquireJobsRunnableImpl trows PSQLException: SSL error: readHandshakeRecord

With a successfully migrated Alfresco (5.2 to) 7.0 repository, I get PSQLException: SSL error: readHandshakeRecord (see stracktrace below) every morning at 4:00, which then causes the repository to stop responding.
Could someone please help me decipher this stack trace? Why is this job running around 4am? I can't find a suitable quartz job. Does anyone know how to manually force this call to fix this problem? At first I thought it might be related to the contentStoreCleaner running at 4:00 am, but disabling this job doesn't change anything.
The only work around I found so far was to disable the activity workflow engine.
2021-08-08 04:35:39,396 ERROR [org.activiti.engine.impl.jobexecutor.AcquireJobsRunnableImpl] [Thread-46] exception during job acquisition: Could not open JDBC Connection for transaction; nested exception is org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:309)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.startTransaction(AbstractPlatformTransactionManager.java:400)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:373)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:137)
at org.activiti.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:45)
at org.activiti.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:31)
at org.activiti.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:40)
at org.activiti.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:35)
at org.activiti.engine.impl.jobexecutor.AcquireJobsRunnableImpl.run(AcquireJobsRunnableImpl.java:54)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:43)
at org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:534)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:149)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:213)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:51)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:223)
at org.postgresql.Driver.makeConnection(Driver.java:465)
at org.postgresql.Driver.connect(Driver.java:264)
at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582)
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1188)
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)
at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:265)
... 9 more
Caused by: javax.net.ssl.SSLException: readHandshakeRecord
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1335)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:440)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411)
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:41)
... 22 more
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:380)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:292)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:450)
... 24 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/sun.security.ssl.SSLSocketOutputRecord.flush(SSLSocketOutputRecord.java:251)
at java.base/sun.security.ssl.HandshakeOutStream.flush(HandshakeOutStream.java:89)
at java.base/sun.security.ssl.Finished$T13FinishedProducer.onProduceFinished(Finished.java:679)
at java.base/sun.security.ssl.Finished$T13FinishedProducer.produce(Finished.java:658)
at java.base/sun.security.ssl.SSLHandshake.produce(SSLHandshake.java:436)
at java.base/sun.security.ssl.Finished$T13FinishedConsumer.onConsumeFinished(Finished.java:1011)
at java.base/sun.security.ssl.Finished$T13FinishedConsumer.consume(Finished.java:874)
at java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392)
at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:443)
at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:421)
at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:182)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:171)
at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1418)
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1324)
... 25 more
The stacktrace was misleading - the jdbc connection problem was caused by memory filling up by the Alfresco trashcan cleaner module: OutOfMemoryError: Java heap space due to no limit on getChildAssocs.
We have ~ 4 million nodes in the trashcan and the module retrieves in every batch run all nodes again and again until the memory has been filled up ...

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

Getting error while redeploying nodes

I am still using Corda 1.0 version. when i try to redeploy nodes with existing data, getting below error while start-up but able to access the nodes . If i clear the data and redeploy the nodes, i didn't face these error message.
Logs can be found in :
C:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kotlin-
source\build\nodes\xxxxxxxx\logs
Database connection url is : jdbc:h2:tcp://xxxxxxxxx/node
E 18:38:46+0530 [main] core.client.createConnection - AMQ214016: Failed to
create netty connection
javax.net.ssl.SSLException: handshake timed out
at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source) ~[netty
all-4.1.9.Final.jar:4.1.9.Final]
Incoming connection address : xxxxxxxxxxxx
Listening on port : 10014
RPC service listening on port : 10015
Loaded CorDapps : corda-finance-1.0.0, kotlin-
source-0.1, corda-core-1.0.0
Node for "xxxxxxxxxxx" started up and registered in 213.08 sec
Welcome to the Corda interactive shell.
Useful commands include 'help' to see what is available, and 'bye' to shut
down the node.
Wed May 23 18:39:20 IST 2018>>> E 18:39:24+0530 [Thread-6 (ActiveMQ-server-
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImp
l$3#4a532271)] core.client.createConnection - AMQ214016: Failed to create
netty connection
javax.net.ssl.SSLException: handshake timed out
at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source) ~[netty-
all-4.1.9.Final.jar:4.1.9.Final]
This looks like the Artemis failed to connect to the node which means the node fails to start.
You should look at the log and if there are any other previous Corda node started which occupy the node.
If there are any legacy Corda nodes that have not been killed, try ps -ef |grep java to see if there is any other java still alive. Especially look for the port number and check if they are overlapped

Every 16th call of swiftmailer:spool:send ends up with STMP timeout

I use file spool in my Symfony2.2 project. I flush spool every minut with cron.
An error occurs every 16 minutes with an accuracy of a few seconds: e.g. 14:38:31, 14:54:31, 15:14:31, 15:30:31. etc.
[12-Aug-2013 16:02:31 Europe/Berlin] PHP Fatal error: Uncaught exception 'Swift_IoException' with message 'Connection to mail.mydomain.pl:25 Timed Out' in /home/projectname/test.projectname.pl/vendor/swiftmailer/swiftmailer/lib/classes/Swift/Transport/StreamBuffer.php:169
Here's stack trace:
Stack trace:
#0 /home/projectname/test.projectname.pl/vendor/swiftmailer/swiftmailer/lib/classes/Swift/Transport/AbstractSmtpTransport.php(400): Swift_Transport_StreamBuffer->readLine(NULL)
#1 /home/projectname/test.projectname.pl/vendor/swiftmailer/swiftmailer/lib/classes/Swift/Transport/AbstractSmtpTransport.php(277): Swift_Transport_AbstractSmtpTransport->_getFullResponse(NULL)
#2 /home/projectname/test.projectname.pl/vendor/swiftmailer/swiftmailer/lib/classes/Swift/Transport/EsmtpTransport.php(245): Swift_Transport_AbstractSmtpTransport->executeCommand('QUIT??', Array, Array)
#3 /home/projectname/test.projectname.pl/vendor/swiftmailer/swiftmailer/lib/classes/Swift/Transport/AbstractSmtpTransport.php(217): Swift_Transport_EsmtpTransport->executeCommand('QUIT??', Array)
#4 /home/projectname/test.projectname.pl/vendor/swiftmailer/swiftmailer/lib/classes/ in /home/projectname/test.projectname.pl/vendor/swiftmailer/swiftmailer/lib/classes/Swift/Transport/StreamBuffer.php on line 169
I'm out of ideas and do not know what I can check next.
Maybe there is a connection limit to the server, but do not know how to check it out. Besides - should not swiftmailer:spool:send close the connection after execution?
This is the command from cron:
/home/projectname/test.projectname.pl/app/console swiftmailer:spool:send
UPDATE: It was due to smtp server provider.

Resources