With a successfully migrated Alfresco (5.2 to) 7.0 repository, I get PSQLException: SSL error: readHandshakeRecord (see stracktrace below) every morning at 4:00, which then causes the repository to stop responding.
Could someone please help me decipher this stack trace? Why is this job running around 4am? I can't find a suitable quartz job. Does anyone know how to manually force this call to fix this problem? At first I thought it might be related to the contentStoreCleaner running at 4:00 am, but disabling this job doesn't change anything.
The only work around I found so far was to disable the activity workflow engine.
2021-08-08 04:35:39,396 ERROR [org.activiti.engine.impl.jobexecutor.AcquireJobsRunnableImpl] [Thread-46] exception during job acquisition: Could not open JDBC Connection for transaction; nested exception is org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:309)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.startTransaction(AbstractPlatformTransactionManager.java:400)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:373)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:137)
at org.activiti.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:45)
at org.activiti.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:31)
at org.activiti.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:40)
at org.activiti.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:35)
at org.activiti.engine.impl.jobexecutor.AcquireJobsRunnableImpl.run(AcquireJobsRunnableImpl.java:54)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.postgresql.util.PSQLException: SSL error: readHandshakeRecord
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:43)
at org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:534)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:149)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:213)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:51)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:223)
at org.postgresql.Driver.makeConnection(Driver.java:465)
at org.postgresql.Driver.connect(Driver.java:264)
at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582)
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1188)
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)
at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:265)
... 9 more
Caused by: javax.net.ssl.SSLException: readHandshakeRecord
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1335)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:440)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411)
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:41)
... 22 more
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:380)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:292)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:450)
... 24 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/sun.security.ssl.SSLSocketOutputRecord.flush(SSLSocketOutputRecord.java:251)
at java.base/sun.security.ssl.HandshakeOutStream.flush(HandshakeOutStream.java:89)
at java.base/sun.security.ssl.Finished$T13FinishedProducer.onProduceFinished(Finished.java:679)
at java.base/sun.security.ssl.Finished$T13FinishedProducer.produce(Finished.java:658)
at java.base/sun.security.ssl.SSLHandshake.produce(SSLHandshake.java:436)
at java.base/sun.security.ssl.Finished$T13FinishedConsumer.onConsumeFinished(Finished.java:1011)
at java.base/sun.security.ssl.Finished$T13FinishedConsumer.consume(Finished.java:874)
at java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392)
at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:443)
at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:421)
at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:182)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:171)
at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1418)
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1324)
... 25 more
The stacktrace was misleading - the jdbc connection problem was caused by memory filling up by the Alfresco trashcan cleaner module: OutOfMemoryError: Java heap space due to no limit on getChildAssocs.
We have ~ 4 million nodes in the trashcan and the module retrieves in every batch run all nodes again and again until the memory has been filled up ...
Related
I have a cluster with 2 galera nodes and 1 arbitrator.
My node 1 crashed I don't understand why..
Here is the log of the node 1.
It seems that it is a problem with the pthread library.
Also every requests are proxied by 2 HAProxy.
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:56 0 [Warning] WSREP: Handshake failed: http request
terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>'
what(): remote_endpoint: Transport endpoint is not connected
230103 12:08:56 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.5.13-MariaDB-1:10.5.13+maria~focal
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=101
max_threads=102
thread_count=106
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 760333 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mariadbd(my_print_stacktrace+0x32)[0x55b1b67f7e42]
Printing to addr2line failed
mariadbd(handle_fatal_signal+0x485)[0x55b1b62479a5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff88ea983c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff88e59e18b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff88e57d859]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff88e939911]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff88e94538c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff88e9453f7]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff88e9456a9]
/usr/lib/galera/libgalera_smm.so(+0x448ad)[0x7ff884b5e8ad]
/usr/lib/galera/libgalera_smm.so(+0x1fc315)[0x7ff884d16315]
/usr/lib/galera/libgalera_smm.so(+0x1ff7eb)[0x7ff884d197eb]
/usr/lib/galera/libgalera_smm.so(+0x1ffc28)[0x7ff884d19c28]
/usr/lib/galera/libgalera_smm.so(+0x2065b6)[0x7ff884d205b6]
/usr/lib/galera/libgalera_smm.so(+0x1f81f3)[0x7ff884d121f3]
/usr/lib/galera/libgalera_smm.so(+0x1e6f04)[0x7ff884d00f04]
/usr/lib/galera/libgalera_smm.so(+0x103438)[0x7ff884c1d438]
/usr/lib/galera/libgalera_smm.so(+0xe8eea)[0x7ff884c02eea]
/usr/lib/galera/libgalera_smm.so(+0xe9a8d)[0x7ff884c03a8d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff88ea8c609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff88e67a293]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Fatal signal 11 while backtracing
PS: if you want more data ask me :)
OK it seems that 2 simultaneous scans of OpenVAS crashes the node.
I tried with version 10.5.13 and 10.5.16 -> crash.
Solution: Upgrade to 10.5.17 at least.
I have deploy API manager 4.0.0 All-in-one on 2 VMs, front the system with a load balancer.
When one node shutdown by command "sh api-manager.sh stop", another swithes success and runs well , but there are some error in console like below:
TID: [-1] [] [2022-03-14 10:14:13,270] ERROR {org.wso2.carbon.databridge.agent.endpoint.DataEndpointConnectionWorker} - Error while trying to connect
to the endpoint. Cannot borrow client for ssl://10.32.73.10:9711 org.wso2.carbon.databridge.agent.exception.DataEndpointAuthenticationException: Cannot borrow client for ssl://10.32.73.10:9711 at org.wso2.carbon.databridge.agent.endpoint.DataEndpointConnectionWorker.connect(DataEndpointConnectionWorker.java:147)
at org.wso2.carbon.databridge.agent.endpoint.DataEndpointConnectionWorker.run(DataEndpointConnectionWorker.java:59)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.wso2.carbon.databridge.agent.exception.DataEndpointException: Error while opening socket to 10.32.73.10:9711. Connection refused (Conne
ction refused) at org.wso2.carbon.databridge.agent.endpoint.binary.BinarySecureClientPoolFactory.createClient(BinarySecureClientPoolFactory.java:75)
at org.wso2.carbon.databridge.agent.client.AbstractClientPoolFactory.makeObject(AbstractClientPoolFactory.java:39)
at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1212)
at org.wso2.carbon.databridge.agent.endpoint.DataEndpointConnectionWorker.connect(DataEndpointConnectionWorker.java:137)
... 6 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394)
at java.net.Socket.connect(Socket.java:606)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:287)
at sun.security.ssl.SSLSocketImpl.<init>(SSLSocketImpl.java:146)
at sun.security.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:88)
at org.wso2.carbon.databridge.agent.endpoint.binary.BinarySecureClientPoolFactory.createClient(BinarySecureClientPoolFactory.java:58)
... 9 more
TID: [-1] [] [2022-03-14 10:14:15,158] WARN {org.wso2.carbon.databridge.agent.endpoint.DataEndpointGroup} - No receiver is reachable at URL Endpoint/
Endpoints [tcp://10.32.73.10:9611], will try to reconnect every 30 sec
Are there anything wrong in the deployment.toml?
In APIM active-active setup, each node is publishing throttling data to itself and to the other node. When you stop the other node, it can't publish the throttling data to the other node. Hence you see connection refused errors and this is expected. No harm having these error logs. It will recover when the other node is started. If you look at the deployment.toml, can find the other node details under the throttling configurations.
I am still using Corda 1.0 version. when i try to redeploy nodes with existing data, getting below error while start-up but able to access the nodes . If i clear the data and redeploy the nodes, i didn't face these error message.
Logs can be found in :
C:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kotlin-
source\build\nodes\xxxxxxxx\logs
Database connection url is : jdbc:h2:tcp://xxxxxxxxx/node
E 18:38:46+0530 [main] core.client.createConnection - AMQ214016: Failed to
create netty connection
javax.net.ssl.SSLException: handshake timed out
at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source) ~[netty
all-4.1.9.Final.jar:4.1.9.Final]
Incoming connection address : xxxxxxxxxxxx
Listening on port : 10014
RPC service listening on port : 10015
Loaded CorDapps : corda-finance-1.0.0, kotlin-
source-0.1, corda-core-1.0.0
Node for "xxxxxxxxxxx" started up and registered in 213.08 sec
Welcome to the Corda interactive shell.
Useful commands include 'help' to see what is available, and 'bye' to shut
down the node.
Wed May 23 18:39:20 IST 2018>>> E 18:39:24+0530 [Thread-6 (ActiveMQ-server-
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImp
l$3#4a532271)] core.client.createConnection - AMQ214016: Failed to create
netty connection
javax.net.ssl.SSLException: handshake timed out
at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source) ~[netty-
all-4.1.9.Final.jar:4.1.9.Final]
This looks like the Artemis failed to connect to the node which means the node fails to start.
You should look at the log and if there are any other previous Corda node started which occupy the node.
If there are any legacy Corda nodes that have not been killed, try ps -ef |grep java to see if there is any other java still alive. Especially look for the port number and check if they are overlapped
I am attempting to bootstrap a private OpenStack cloud using Cloudify 2.7.1. It boots up the Linux instance correctly but fails "Uploading files to 192.168.10.XXX." due to an SFTP problem : "Could not determine the type of file "sftp://root:***#192.168.10.xxx/root/gs-files".".
I can access to the Instance using ssh (there is no probleme in the connection). I tried with other images (CentOS, Ubuntu, Cerros, ...) but always the same error !!
can anyone help me please ?
I attached a screenshot of the network topology created by Cloudify, and the stack trace.
Full stack trace:
2015-04-30 10:26:27,470 INFO [org.cloudifysource.shell.commands.AbstractGSCommand] - Setting security profile to "nonsecure".
2015-04-30
10:26:27,589 INFO
[org.cloudifysource.shell.commands.AbstractGSCommand] - Bootstrapping
cloud openstack-havana. This may take a few minutes.
2015-04-30
10:26:27,677 INFO
[org.cloudifysource.esc.driver.provisioning.BaseProvisioningDriver] -
Setup network configuration for managers
2015-04-30 10:26:27,677
INFO [org.cloudifysource.esc.driver.provisioning.BaseProvisioningDriver]
- Using management network : Cloudify-Management-Network
2015-04-30
10:26:51,536 INFO
[org.cloudifysource.esc.shell.listener.CliAgentlessInstallerListener] -
Attempting to access Management VM 192.168.10.241.
2015-04-30
10:27:10,551 INFO
[org.cloudifysource.esc.shell.listener.CliAgentlessInstallerListener] -
Uploading files to 192.168.10.241.
2015-04-30 10:27:15,708 WARNING [com.jcraft.jsch] - Permanently added '192.168.10.241' (RSA) to the list of known hosts.
2015-04-30
10:27:25,998 INFO
[org.cloudifysource.esc.shell.installer.CloudGridAgentBootstrapper] -
Failed accessing management VM 192.168.10.241 Reason: Failed to set up
file transfer: Unknown message with code "Could not determine the type
of file "sftp://cirros#192.168.10.241/cirros/gs-files".".; Caused by:
org.cloudifysource.esc.installer.InstallerException: Failed to set up
file transfer: Unknown message with code "Could not determine the type
of file "sftp://cirros#192.168.10.241/cirros/gs-files".".
2015-04-30
10:27:26,210 INFO
[org.cloudifysource.esc.driver.provisioning.openstack.OpenStackCloudifyDriver]
- Deleting Floating ip:
FloatingIp[floatingNetworkId=15578898-5e6b-44d9-a73a-1328ca6ea140,floatingIpAddress=192.168.10.241,portId=4b8dc211-12e8-4383-8799-f783d2786e98,id=593d8424-cfec-41ed-8204-ed8609366416]
2015-04-30
10:27:29,607 SEVERE
[org.cloudifysource.shell.commands.AbstractGSCommand] - Failed to set up
file transfer: Unknown message with code "Could not determine the type
of file "sftp://cirros#192.168.10.241/cirros/gs-files".". :
org.cloudifysource.esc.installer.InstallerException: Failed to set up
file transfer: Unknown message with code "Could not determine the type
of file "sftp://cirros#192.168.10.241/cirros/gs-files".".
at org.cloudifysource.esc.installer.filetransfer.VfsFileTransfer.initialize(VfsFileTransfer.java:206)
at org.cloudifysource.esc.installer.AgentlessInstaller.uploadFilesToServer(AgentlessInstaller.java:306)
at org.cloudifysource.esc.installer.AgentlessInstaller.installOnMachineWithIP(AgentlessInstaller.java:210)
at org.cloudifysource.esc.shell.installer.CloudGridAgentBootstrapper$1.call(CloudGridAgentBootstrapper.java:865)
at org.cloudifysource.esc.shell.installer.CloudGridAgentBootstrapper$1.call(CloudGridAgentBootstrapper.java:860)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused
by: org.apache.commons.vfs2.FileSystemException: Unknown message with
code "Could not determine the type of file
"sftp://cirros#192.168.10.241/cirros/gs-files".".
at org.apache.commons.vfs2.provider.sftp.SftpFileObject.refresh(SftpFileObject.java:95)
at org.apache.commons.vfs2.provider.AbstractFileSystem.resolveFile(AbstractFileSystem.java:366)
at org.apache.commons.vfs2.provider.AbstractFileSystem.resolveFile(AbstractFileSystem.java:317)
at org.apache.commons.vfs2.provider.AbstractOriginatingFileProvider.findFile(AbstractOriginatingFileProvider.java:85)
at org.apache.commons.vfs2.provider.AbstractOriginatingFileProvider.findFile(AbstractOriginatingFileProvider.java:65)
at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:693)
at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:621)
at org.cloudifysource.esc.installer.filetransfer.VfsFileTransfer.resolveTargetDirectory(VfsFileTransfer.java:218)
at org.cloudifysource.esc.installer.filetransfer.VfsFileTransfer.initialize(VfsFileTransfer.java:203)
... 8 more
Caused
by: org.apache.commons.vfs2.FileSystemException: Could not determine
the type of file "sftp://cirros#192.168.10.241/cirros/gs-files".
at org.apache.commons.vfs2.provider.AbstractFileObject.getType(AbstractFileObject.java:505)
at org.apache.commons.vfs2.provider.sftp.SftpFileObject.refresh(SftpFileObject.java:91)
... 16 more
Caused by: org.apache.commons.vfs2.FileSystemException: Could not connect to SFTP server at "sftp://cirros#192.168.10.241/".
at org.apache.commons.vfs2.provider.sftp.SftpFileSystem.getChannel(SftpFileSystem.java:153)
at org.apache.commons.vfs2.provider.sftp.SftpFileObject.statSelf(SftpFileObject.java:151)
at org.apache.commons.vfs2.provider.sftp.SftpFileObject.doGetType(SftpFileObject.java:114)
at org.apache.commons.vfs2.provider.AbstractFileObject.getType(AbstractFileObject.java:496)
... 17 more
Caused by: com.jcraft.jsch.JSchException: java.io.IOException: Pipe closed
at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:288)
at com.jcraft.jsch.Channel.connect(Channel.java:152)
at com.jcraft.jsch.Channel.connect(Channel.java:145)
at org.apache.commons.vfs2.provider.sftp.SftpFileSystem.getChannel(SftpFileSystem.java:130)
... 20 more
Caused by: java.io.IOException: Pipe closed
at java.io.PipedInputStream.read(PipedInputStream.java:308)
at java.io.PipedInputStream.read(PipedInputStream.java:378)
at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2665)
at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2691)
at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:257)
... 23 more
Looks like you are trying to sftp into a cirros instance - I am not sure cirros even supports sftp. You can try this by using the sftp command line utility.
In general, sftp has to be configured and available on the target machine.
You can try using the SCP file transfer mode by setting this in your compute template:
fileTransfer org.cloudifysource.domain.cloud.FileTransferModes.SCP
If you are really using cirros, I suspect bootstrapping will fail. Cloudify was never tested on cirros. I think cirros is lacking some very basic utilities (I think it is not running bash. Not sure if it has wget). Cirros was never meant as a generic distribution - it is meant for testing your cloud's basic functionality.
One more thing - Cloudify 2 has reached End-of-Life - it is no longer supported. You should check out Cloudify 3.
I'm trying to get Spark, and SparkR, running on a small EC2 cluster using the provided scripts and directions. Whenever I ask for an operation that would require computation on an RDD (e.g., collect(), reduce()), I get the error logged below. Workers do appear to startup correctly -- if I only parallelize, I see the workers running via the master's web ui.
The error I get is similar to the one in Intermittent Timeout Exception using Spark and I've been through all of the solutions there (modifying the conf file for URL's, disabling the firewall, etc.), no luck.
Here is the error log, thank you in advance for your help:
15/02/17 19:10:22 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/02/17 19:10:22 INFO spark.SecurityManager: Changing view acls to: root,-
15/02/17 19:10:22 INFO spark.SecurityManager: Changing modify acls to: root,-
15/02/17 19:10:22 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, -); users with modify permissions: Set(root, -)
15/02/17 19:10:23 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/17 19:10:23 INFO Remoting: Starting remoting
15/02/17 19:10:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#-.ec2.internal:60218]
15/02/17 19:10:23 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 60218.
15/02/17 19:10:53 ERROR security.UserGroupInformation: PriviledgedActionException as:- cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:161)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
... 7 more
This was ultimately resolved by a combination of
- Updates to SparkR, which have resolved a number of serialization issues.
- Recognizing that the Spark-ec2 scripts require that the control node and master node be the same machine.
and
- Replacing calls to parallelize() with distributing and then loading the data by hadoop.
I am writing an intro to SparkR for R programmers that I hope will help people with things like this in the future.