Spark & SparkR Configuration on EC2 - Java Timeout - r

I'm trying to get Spark, and SparkR, running on a small EC2 cluster using the provided scripts and directions. Whenever I ask for an operation that would require computation on an RDD (e.g., collect(), reduce()), I get the error logged below. Workers do appear to startup correctly -- if I only parallelize, I see the workers running via the master's web ui.
The error I get is similar to the one in Intermittent Timeout Exception using Spark and I've been through all of the solutions there (modifying the conf file for URL's, disabling the firewall, etc.), no luck.
Here is the error log, thank you in advance for your help:
15/02/17 19:10:22 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/02/17 19:10:22 INFO spark.SecurityManager: Changing view acls to: root,-
15/02/17 19:10:22 INFO spark.SecurityManager: Changing modify acls to: root,-
15/02/17 19:10:22 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, -); users with modify permissions: Set(root, -)
15/02/17 19:10:23 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/17 19:10:23 INFO Remoting: Starting remoting
15/02/17 19:10:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#-.ec2.internal:60218]
15/02/17 19:10:23 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 60218.
15/02/17 19:10:53 ERROR security.UserGroupInformation: PriviledgedActionException as:- cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:161)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
... 7 more

This was ultimately resolved by a combination of
- Updates to SparkR, which have resolved a number of serialization issues.
- Recognizing that the Spark-ec2 scripts require that the control node and master node be the same machine.
and
- Replacing calls to parallelize() with distributing and then loading the data by hadoop.
I am writing an intro to SparkR for R programmers that I hope will help people with things like this in the future.

Related

Failed to bind handlers: DNS Error: Not Implemented; retrying in 2 s

When trying to start Yagna I receive this error, what can I do? I can probably get some DEBUG logs if needed?
[2021-05-06T08:45:08Z INFO yagna] Starting yagna service! Version: 0.6.4 (4fc72117 2021-04-15 build #135).
Log is written to /home/user/.local/share/yagna/yagna_rCURRENT.log
[2021-05-06T08:45:08Z INFO yagna] Data directory: /home/user/.local/share/yagna
[2021-05-06T08:45:08Z INFO ya_sb_router::unix] Router listening on: "/tmp/yagna.sock"
[2021-05-06T08:45:08Z INFO ya_persistence::executor] using database at: /home/user/.local/share/yagna/yagna.db
[2021-05-06T08:45:08Z INFO ya_persistence::executor] using database at: /home/user/.local/share/yagna/market.db
[2021-05-06T08:45:08Z INFO ya_persistence::executor] using database at: /home/user/.local/share/yagna/activity.db
[2021-05-06T08:45:08Z INFO ya_persistence::executor] using database at: /home/user/.local/share/yagna/payment.db
[2021-05-06T08:45:08Z INFO ya_identity::service::identity] using default identity: 0xf5ecffecf053508fe97255e046a04ce21c8ee525
[2021-05-06T08:45:08Z INFO yagna] Identity GSB service successfully activated
[2021-05-06T08:45:08Z INFO ya_metrics::pusher] Metrics pusher started
[2021-05-06T08:45:08Z INFO yagna] Metrics GSB service successfully activated
[2021-05-06T08:45:08Z INFO ya_service_bus::remote_router] trying to connect to: /tmp/yagna.sock
[2021-05-06T08:45:08Z INFO ya_service_bus::connection] started connection to gsb
[2021-05-06T08:45:08Z INFO ya_metrics::pusher] Starting metrics pusher
[2021-05-06T08:45:10Z INFO yagna] Version GSB service successfully activated
[2021-05-06T08:45:10Z INFO ya_net::service] using default identity as network id: 0xf5ecffecf053508fe97255e046a04ce21c8ee525
[2021-05-06T08:45:10Z WARN ya_net::handler] Failed to bind handlers: DNS Error: Not Implemented; retrying in 2 s
[2021-05-06T08:45:12Z WARN ya_net::handler] Failed to bind handlers: DNS Error: Not Implemented; retrying in 4 s
[2021-05-06T08:45:16Z WARN ya_net::handler] Failed to bind handlers: DNS Error: Not Implemented; retrying in 8 s
EDIT: nslookup
Server: 10.139.1.1
Address: 10.139.1.1#53
** server can't find _net._tcp.dev.golem.network: NOTIMP
I'm not sure what is the reason here, but it seems like DNS is not able to resolve _net._tcp.dev.golem.network SRV record yielding 'Not Implemented'. It is very odd, since Yagna is using Google's DNS servers as a default.
When you face this again pls try to check output of following command
nslookup -q=SRV _net._tcp.dev.golem.network 8.8.8.8
The user has trouble reaching Google's DNS with nslookup, so it appears to be something on his end. He is also using a proxy for his connection, so it must happen somewhere in there. Closing thread.

visual studio code-server not workin with compute engine

I host a web site with Nginx in Debian in google cloud.
when I install code server and i visit: my_ip:8443, chrome replied with: ERR_CONNECTION_TIMED_OUT.
when I execute the command code-server:
INFO code-server v1.1156-vsc1.33.1
INFO Additional documentation: http://github.com/cdr/code-server
INFO Initializing {"data-dir":"/home/naji/.local/share/code-server","extensions-dir":"/home/naji/.local/share/code-server/extensions","working-dir":"/","log-dir":"/home/naji/.cache/code-server/logs/20190723145420188"}
INFO Starting webserver... {"host":"0.0.0.0","port":8443}
WARN No certificate specified. This could be insecure.
WARN Documentation on securing your setup: https://github.com/cdr/code-server/blob/master/doc/security/ssl.md
INFO
INFO Password: ******************
INFO
INFO Started (click the link below to open):
INFO https://localhost:8443/
INFO
INFO Starting shared process [1/5]...
WARN stderr {"data":"(node:17025) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.\n"}
INFO Connected to shared process
what is the solution ?

Getting error while redeploying nodes

I am still using Corda 1.0 version. when i try to redeploy nodes with existing data, getting below error while start-up but able to access the nodes . If i clear the data and redeploy the nodes, i didn't face these error message.
Logs can be found in :
C:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kotlin-
source\build\nodes\xxxxxxxx\logs
Database connection url is : jdbc:h2:tcp://xxxxxxxxx/node
E 18:38:46+0530 [main] core.client.createConnection - AMQ214016: Failed to
create netty connection
javax.net.ssl.SSLException: handshake timed out
at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source) ~[netty
all-4.1.9.Final.jar:4.1.9.Final]
Incoming connection address : xxxxxxxxxxxx
Listening on port : 10014
RPC service listening on port : 10015
Loaded CorDapps : corda-finance-1.0.0, kotlin-
source-0.1, corda-core-1.0.0
Node for "xxxxxxxxxxx" started up and registered in 213.08 sec
Welcome to the Corda interactive shell.
Useful commands include 'help' to see what is available, and 'bye' to shut
down the node.
Wed May 23 18:39:20 IST 2018>>> E 18:39:24+0530 [Thread-6 (ActiveMQ-server-
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImp
l$3#4a532271)] core.client.createConnection - AMQ214016: Failed to create
netty connection
javax.net.ssl.SSLException: handshake timed out
at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source) ~[netty-
all-4.1.9.Final.jar:4.1.9.Final]
This looks like the Artemis failed to connect to the node which means the node fails to start.
You should look at the log and if there are any other previous Corda node started which occupy the node.
If there are any legacy Corda nodes that have not been killed, try ps -ef |grep java to see if there is any other java still alive. Especially look for the port number and check if they are overlapped

Cloudify : "Could not determine the type of file "sftp://root:***#192.168.10.xxx/root/gs-files"."

I am attempting to bootstrap a private OpenStack cloud using Cloudify 2.7.1. It boots up the Linux instance correctly but fails "Uploading files to 192.168.10.XXX." due to an SFTP problem : "Could not determine the type of file "sftp://root:***#192.168.10.xxx/root/gs-files".".
I can access to the Instance using ssh (there is no probleme in the connection). I tried with other images (CentOS, Ubuntu, Cerros, ...) but always the same error !!
can anyone help me please ?
I attached a screenshot of the network topology created by Cloudify, and the stack trace.
Full stack trace:
2015-04-30 10:26:27,470 INFO [org.cloudifysource.shell.commands.AbstractGSCommand] - Setting security profile to "nonsecure".
2015-04-30
10:26:27,589 INFO
[org.cloudifysource.shell.commands.AbstractGSCommand] - Bootstrapping
cloud openstack-havana. This may take a few minutes.
2015-04-30
10:26:27,677 INFO
[org.cloudifysource.esc.driver.provisioning.BaseProvisioningDriver] -
Setup network configuration for managers
2015-04-30 10:26:27,677
INFO [org.cloudifysource.esc.driver.provisioning.BaseProvisioningDriver]
- Using management network : Cloudify-Management-Network
2015-04-30
10:26:51,536 INFO
[org.cloudifysource.esc.shell.listener.CliAgentlessInstallerListener] -
Attempting to access Management VM 192.168.10.241.
2015-04-30
10:27:10,551 INFO
[org.cloudifysource.esc.shell.listener.CliAgentlessInstallerListener] -
Uploading files to 192.168.10.241.
2015-04-30 10:27:15,708 WARNING [com.jcraft.jsch] - Permanently added '192.168.10.241' (RSA) to the list of known hosts.
2015-04-30
10:27:25,998 INFO
[org.cloudifysource.esc.shell.installer.CloudGridAgentBootstrapper] -
Failed accessing management VM 192.168.10.241 Reason: Failed to set up
file transfer: Unknown message with code "Could not determine the type
of file "sftp://cirros#192.168.10.241/cirros/gs-files".".; Caused by:
org.cloudifysource.esc.installer.InstallerException: Failed to set up
file transfer: Unknown message with code "Could not determine the type
of file "sftp://cirros#192.168.10.241/cirros/gs-files".".
2015-04-30
10:27:26,210 INFO
[org.cloudifysource.esc.driver.provisioning.openstack.OpenStackCloudifyDriver]
- Deleting Floating ip:
FloatingIp[floatingNetworkId=15578898-5e6b-44d9-a73a-1328ca6ea140,floatingIpAddress=192.168.10.241,portId=4b8dc211-12e8-4383-8799-f783d2786e98,id=593d8424-cfec-41ed-8204-ed8609366416]
2015-04-30
10:27:29,607 SEVERE
[org.cloudifysource.shell.commands.AbstractGSCommand] - Failed to set up
file transfer: Unknown message with code "Could not determine the type
of file "sftp://cirros#192.168.10.241/cirros/gs-files".". :
org.cloudifysource.esc.installer.InstallerException: Failed to set up
file transfer: Unknown message with code "Could not determine the type
of file "sftp://cirros#192.168.10.241/cirros/gs-files".".
at org.cloudifysource.esc.installer.filetransfer.VfsFileTransfer.initialize(VfsFileTransfer.java:206)
at org.cloudifysource.esc.installer.AgentlessInstaller.uploadFilesToServer(AgentlessInstaller.java:306)
at org.cloudifysource.esc.installer.AgentlessInstaller.installOnMachineWithIP(AgentlessInstaller.java:210)
at org.cloudifysource.esc.shell.installer.CloudGridAgentBootstrapper$1.call(CloudGridAgentBootstrapper.java:865)
at org.cloudifysource.esc.shell.installer.CloudGridAgentBootstrapper$1.call(CloudGridAgentBootstrapper.java:860)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused
by: org.apache.commons.vfs2.FileSystemException: Unknown message with
code "Could not determine the type of file
"sftp://cirros#192.168.10.241/cirros/gs-files".".
at org.apache.commons.vfs2.provider.sftp.SftpFileObject.refresh(SftpFileObject.java:95)
at org.apache.commons.vfs2.provider.AbstractFileSystem.resolveFile(AbstractFileSystem.java:366)
at org.apache.commons.vfs2.provider.AbstractFileSystem.resolveFile(AbstractFileSystem.java:317)
at org.apache.commons.vfs2.provider.AbstractOriginatingFileProvider.findFile(AbstractOriginatingFileProvider.java:85)
at org.apache.commons.vfs2.provider.AbstractOriginatingFileProvider.findFile(AbstractOriginatingFileProvider.java:65)
at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:693)
at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:621)
at org.cloudifysource.esc.installer.filetransfer.VfsFileTransfer.resolveTargetDirectory(VfsFileTransfer.java:218)
at org.cloudifysource.esc.installer.filetransfer.VfsFileTransfer.initialize(VfsFileTransfer.java:203)
... 8 more
Caused
by: org.apache.commons.vfs2.FileSystemException: Could not determine
the type of file "sftp://cirros#192.168.10.241/cirros/gs-files".
at org.apache.commons.vfs2.provider.AbstractFileObject.getType(AbstractFileObject.java:505)
at org.apache.commons.vfs2.provider.sftp.SftpFileObject.refresh(SftpFileObject.java:91)
... 16 more
Caused by: org.apache.commons.vfs2.FileSystemException: Could not connect to SFTP server at "sftp://cirros#192.168.10.241/".
at org.apache.commons.vfs2.provider.sftp.SftpFileSystem.getChannel(SftpFileSystem.java:153)
at org.apache.commons.vfs2.provider.sftp.SftpFileObject.statSelf(SftpFileObject.java:151)
at org.apache.commons.vfs2.provider.sftp.SftpFileObject.doGetType(SftpFileObject.java:114)
at org.apache.commons.vfs2.provider.AbstractFileObject.getType(AbstractFileObject.java:496)
... 17 more
Caused by: com.jcraft.jsch.JSchException: java.io.IOException: Pipe closed
at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:288)
at com.jcraft.jsch.Channel.connect(Channel.java:152)
at com.jcraft.jsch.Channel.connect(Channel.java:145)
at org.apache.commons.vfs2.provider.sftp.SftpFileSystem.getChannel(SftpFileSystem.java:130)
... 20 more
Caused by: java.io.IOException: Pipe closed
at java.io.PipedInputStream.read(PipedInputStream.java:308)
at java.io.PipedInputStream.read(PipedInputStream.java:378)
at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2665)
at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2691)
at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:257)
... 23 more
Looks like you are trying to sftp into a cirros instance - I am not sure cirros even supports sftp. You can try this by using the sftp command line utility.
In general, sftp has to be configured and available on the target machine.
You can try using the SCP file transfer mode by setting this in your compute template:
fileTransfer org.cloudifysource.domain.cloud.FileTransferModes.SCP
If you are really using cirros, I suspect bootstrapping will fail. Cloudify was never tested on cirros. I think cirros is lacking some very basic utilities (I think it is not running bash. Not sure if it has wget). Cirros was never meant as a generic distribution - it is meant for testing your cloud's basic functionality.
One more thing - Cloudify 2 has reached End-of-Life - it is no longer supported. You should check out Cloudify 3.

Passing hostname to netty

Background: I've got two machines with identical hostnames, I need to set up a local spark cluster for testing, setting up a master and a worker works fine, but trying to run an application with the driver causes problems, netty doesn't seem to be picking the correct host (regardless of what I put in there, it just picks the first host).
Identical hostname:
$ dig +short corehost
192.168.0.100
192.168.0.101
Spark config (used by master and the local worker):
export SPARK_LOCAL_DIRS=/some/dir
export SPARK_LOCAL_IP=corehost // i tried various like 192.168.0.x for
export SPARK_MASTER_IP=corehost // local, master and the driver
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_DIR=/some/dir
Spark starts up and I can see the worker in the web-ui.
When I run the spark "job" below:
val conf = new SparkConf().setAppName("AaA")
// tried 192.168.0.x and localhost
.setMaster("spark://corehost:7077")
val sc = new SparkContext(conf)
I get this exception:
15/04/02 12:34:04 INFO SparkContext: Running Spark version 1.3.0
15/04/02 12:34:04 WARN Utils: Your hostname, corehost resolves to a loopback address: 127.0.0.1; using 192.168.0.100 instead (on interface en1)
15/04/02 12:34:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/04/02 12:34:05 ERROR NettyTransport: failed to bind to corehost.home/192.168.0.101:0, shutting down Netty transport
...
Exception in thread "main" java.net.BindException: Failed to bind to: corehost.home/192.168.0.101:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
Process finished with exit code 1
Not sure how to proceed... its a whole jungle of ip addresses.
Not sure if this is a netty issue either.
My experience with the identical problem is that it revolves around setting things up locally. Try being more verbose in your spark driver code, add the SPARK_LOCAL_IP and driver host ip to the config:
val conf = new SparkConf().setAppName("AaA")
.setMaster("spark://localhost:7077")
.set("spark.local.ip","192.168.1.100")
.set("spark.driver.host","192.168.1.100")
This should tell netty which of the two identical hosts to use.

Resources