I have tried tdimport utility to copy the data from teradata into hadoop but it is failing "15/12/11 07:21:28 INFO processor.HiveOutputProcessor: hive table default.pos_rtl_str_test does not exist"
How could I specify the hive schema name here?
[biadmin#ehaasp-10035-master-3]/usr/iop/4.1.0.0/hive/bin>/usr/iop/4.1.0.0/sqoop/bin/sqoop tdimport --connect jdbc:teradata://<<ipaddress>>/database=EDW01_V_LV_BASE --username <<username>> --password <<password>> --as-textfile --hive-table pos_rtl_str_test --table pos_rtl_str --columns "RTL_STR_ID, RTL_STR_LANG_CD" --split-by RTL_STR_ID
Warning: /usr/iop/4.1.0.0/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/12/11 07:20:47 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6_IBM_20
15/12/11 07:20:47 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/12/11 07:20:47 INFO common.ConnectorPlugin: load plugins in jar:file:/usr/iop/4.1.0.0/sqoop/lib/teradata-connector-1.4.1.jar!/teradata.connector.plugins.xml
15/12/11 07:20:47 WARN conf.HiveConf: HiveConf of name hive.heapsize does not exist
15/12/11 07:20:47 INFO hive.metastore: Trying to connect to metastore with URI thrift://<<masternode dns name>>:9083
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/iop/4.1.0.0/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/iop/4.1.0.0/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/12/11 07:20:48 INFO hive.metastore: Connected to metastore.
15/12/11 07:20:48 INFO processor.TeradataInputProcessor: input preprocessor com.teradata.connector.teradata.processor.TeradataSplitByHashProcessor starts at: 1449818448723
15/12/11 07:20:49 INFO utils.TeradataUtils: the input database product is Teradata
15/12/11 07:20:49 INFO utils.TeradataUtils: the input database version is 14.10
15/12/11 07:20:49 INFO utils.TeradataUtils: the jdbc driver version is 15.0
15/12/11 07:21:07 INFO processor.TeradataInputProcessor: the teradata connector for hadoop version is: 1.4.1
15/12/11 07:21:07 INFO processor.TeradataInputProcessor: input jdbc properties are jdbc:teradata://<<ipaddress>>/database=<<database>>
15/12/11 07:21:27 INFO processor.TeradataInputProcessor: the number of mappers are 4
15/12/11 07:21:27 INFO processor.TeradataInputProcessor: input preprocessor com.teradata.connector.teradata.processor.TeradataSplitByHashProcessor ends at: 1449818487899
15/12/11 07:21:27 INFO processor.TeradataInputProcessor: the total elapsed time of input preprocessor com.teradata.connector.teradata.processor.TeradataSplitByHashProcessor is: 39s
15/12/11 07:21:28 WARN conf.HiveConf: HiveConf of name hive.heapsize does not exist
15/12/11 07:21:28 INFO hive.metastore: Trying to connect to metastore with URI thrift://ehaasp-10035-master-3.bi.services.bluemix.net:9083
15/12/11 07:21:28 INFO hive.metastore: Connected to metastore.
15/12/11 07:21:28 INFO processor.HiveOutputProcessor: hive table default.pos_rtl_str_test does not exist
15/12/11 07:21:28 WARN tool.ConnectorJobRunner: com.teradata.connector.common.exception.ConnectorException: The output post processor returns 1
15/12/11 07:21:28 INFO processor.TeradataInputProcessor: input postprocessor com.teradata.connector.teradata.processor.TeradataSplitByHashProcessor starts at: 1449818488581
15/12/11 07:21:28 INFO processor.TeradataInputProcessor: input postprocessor com.teradata.connector.teradata.processor.TeradataSplitByHashProcessor ends at: 1449818488581
15/12/11 07:21:28 INFO processor.TeradataInputProcessor: the total elapsed time of input postprocessor com.teradata.connector.teradata.processor.TeradataSplitByHashProcessor is: 0s
15/12/11 07:21:28 ERROR wrapper.TDImportTool: Teradata Connector for Hadoop tool error.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.ibm.biginsights.ie.sqoop.td.wrapper.TDImportTool.callTDCH(TDImportTool.java:104)
at com.ibm.biginsights.ie.sqoop.td.wrapper.TDImportTool.run(TDImportTool.java:72)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Caused by: com.teradata.connector.common.exception.ConnectorException: Import Hive table's column schema is missing
at com.teradata.connector.common.tool.ConnectorJobRunner.runJob(ConnectorJobRunner.java:140)
... 12 more
Have you tried to fully qualify the location of your hive table? The error suggest the default database does not contain the schema for the table.
tdimport --connect jdbc:teradata://<<ipaddress>>/database=EDW01_V_LV_BASE
--username <<username>> --password <<password>>
--as-textfile --hive-table {hivedb}.pos_rtl_str_test
--table pos_rtl_str --columns "RTL_STR_ID, RTL_STR_LANG_CD"
--split-by RTL_STR_ID
If the table does not exist and you are trying to create it at the same time, you need to include the --map-column-hive parameter.
Related
When trying to start Yagna I receive this error, what can I do? I can probably get some DEBUG logs if needed?
[2021-05-06T08:45:08Z INFO yagna] Starting yagna service! Version: 0.6.4 (4fc72117 2021-04-15 build #135).
Log is written to /home/user/.local/share/yagna/yagna_rCURRENT.log
[2021-05-06T08:45:08Z INFO yagna] Data directory: /home/user/.local/share/yagna
[2021-05-06T08:45:08Z INFO ya_sb_router::unix] Router listening on: "/tmp/yagna.sock"
[2021-05-06T08:45:08Z INFO ya_persistence::executor] using database at: /home/user/.local/share/yagna/yagna.db
[2021-05-06T08:45:08Z INFO ya_persistence::executor] using database at: /home/user/.local/share/yagna/market.db
[2021-05-06T08:45:08Z INFO ya_persistence::executor] using database at: /home/user/.local/share/yagna/activity.db
[2021-05-06T08:45:08Z INFO ya_persistence::executor] using database at: /home/user/.local/share/yagna/payment.db
[2021-05-06T08:45:08Z INFO ya_identity::service::identity] using default identity: 0xf5ecffecf053508fe97255e046a04ce21c8ee525
[2021-05-06T08:45:08Z INFO yagna] Identity GSB service successfully activated
[2021-05-06T08:45:08Z INFO ya_metrics::pusher] Metrics pusher started
[2021-05-06T08:45:08Z INFO yagna] Metrics GSB service successfully activated
[2021-05-06T08:45:08Z INFO ya_service_bus::remote_router] trying to connect to: /tmp/yagna.sock
[2021-05-06T08:45:08Z INFO ya_service_bus::connection] started connection to gsb
[2021-05-06T08:45:08Z INFO ya_metrics::pusher] Starting metrics pusher
[2021-05-06T08:45:10Z INFO yagna] Version GSB service successfully activated
[2021-05-06T08:45:10Z INFO ya_net::service] using default identity as network id: 0xf5ecffecf053508fe97255e046a04ce21c8ee525
[2021-05-06T08:45:10Z WARN ya_net::handler] Failed to bind handlers: DNS Error: Not Implemented; retrying in 2 s
[2021-05-06T08:45:12Z WARN ya_net::handler] Failed to bind handlers: DNS Error: Not Implemented; retrying in 4 s
[2021-05-06T08:45:16Z WARN ya_net::handler] Failed to bind handlers: DNS Error: Not Implemented; retrying in 8 s
EDIT: nslookup
Server: 10.139.1.1
Address: 10.139.1.1#53
** server can't find _net._tcp.dev.golem.network: NOTIMP
I'm not sure what is the reason here, but it seems like DNS is not able to resolve _net._tcp.dev.golem.network SRV record yielding 'Not Implemented'. It is very odd, since Yagna is using Google's DNS servers as a default.
When you face this again pls try to check output of following command
nslookup -q=SRV _net._tcp.dev.golem.network 8.8.8.8
The user has trouble reaching Google's DNS with nslookup, so it appears to be something on his end. He is also using a proxy for his connection, so it must happen somewhere in there. Closing thread.
I'm trying to connect from R to a remote spark cluster.
The spark cluster is build on debian jessie and the R version i can install on it is at most 3.3 but I need 3.4 to be able to run FactoMineR. So I installed R on another machine and try to connect the cluster using sparklyr 0.8.4
> sc <- spark_connect(master = "spark://spark-cluster-m:7077", spark_home="/usr/lib/spark/", version="2.2.1")
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory '/usr/lib/spark/' not found
spark isn't installed on the local machine but on the spark-cluster-m, it is :
jc#spark-cluster-m:/usr/lib/spark$ ls
bin conf data examples external jars LICENSE licenses NOTICE python R README.md RELEASE sbin work yarn
Have I missed something ?
The spark cluster is on google cloud (test account) and so is the VM with R. How do I verify the port spark can be connected to ?
Thanks for your clues
#user16... You're right, this particular problem seems to be solved but my way is not ended.
I installed the same spark version (2.2.1 with hadoop > 2.7)
Here is my new error message :
Error in force(code) :
Failed during initialize_connection: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke.invoke(invoke.scala:137)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:123)
at sparklyr.StreamHandler.read(stream.scala:66)
at sparklyr.BackendHandler.channelRead0(handler.scala:51)
at sparklyr.BackendHandler.channelRead0(handler.scala:4)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
Log: /tmp/RtmpTUh0z6/file5d231368db0_spark.log
---- Output Log ----
at io.netty.channel.nio.NioEventLoop.processS
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
18/07/21 18:24:59 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-cluster-m:7077...
18/07/21 18:24:59 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master spark-cluster-m:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to spark-cluster-m/10.142.0.3:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: spark-cluster-m/10.142.0.3:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
18/07/21 18:25:19 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
18/07/21 18:25:19 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
18/07/21 18:25:19 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46811.
18/07/21 18:25:19 INFO NettyBlockTransferService: Server created on 10.142.0.5:46811
18/07/21 18:25:19 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/07/21 18:25:19 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.142.0.5, 46811, None)
18/07/21 18:25:19 INFO BlockManagerMasterEndpoint: Registering block manager 10.142.0.5:46811 with 366.3 MB RAM, BlockManagerId(driver, 10.142.0.5, 46811, None)
18/07/21 18:25:19 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.142.0.5, 46811, None)
18/07/21 18:25:19 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.142.0.5, 46811, None)
18/07/21 18:25:19 INFO SparkUI: Stopped Spark web UI at http://10.142.0.5:4040
18/07/21 18:25:19 INFO StandaloneSchedulerBackend: Shutting down all executors
18/07/21 18:25:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/07/21 18:25:19 WARN StandaloneAppClient$ClientEndpoint: Drop Unregist
I can see it can resolve the name (=> 10.142.0.3)
Also, it seems to be the good port as if I use port 7000, i have the error :
18/07/21 18:32:54 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from spark-cluster-m/10.142.0.3:7000 is closed
18/07/21 18:32:54 WARN StandaloneAppClient$ClientEndpoint: Could not connect to spark-cluster-m:7000: java.io.IOException: Connection reset by peer
18/07/21 18:32:54 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master spark-cluster-m:7000
But I can't figure out what this means.
You say my configuration is "particular". If there is a better (and simple) approach, i would be glad to use it.
Here is how I proceeded in my tests :
I created a google dataproc cluster with spark (2.2.1)
I added Cassandra on each node
At this stage, everything works ok.
Then, i need to install FactoMineR as I'd like to try HMFA. It is said to run with R > 3.0.0 so it seems to be ok but it depends on nlme which can't be installed on R < 3.4.0 (and the one in the debian jessie backports is 3.3.)
So, what can I do ?
I must admit that i'm not very enthusiastic in restarting a full spark / cassandra cluster install from scratch...
I am not not able to start nodes on a Linux server
I am getting the following (edited) output
[user#host nodes]$ ./runnodes
which: no osascript in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/home/user/.local/bin:/home/user/bin)
Starting nodes in /opt/nodes
Starting corda.jar in /opt/nodes/NodeA on debug port 5005
Starting corda-webserver.jar in /opt/nodes/Agent on debug port 5006
Starting corda.jar in /opt/nodes/NodeB on debug port 5007
Starting corda-webserver.jar in /opt/nodes/NodeB on debug port 5008
Starting corda.jar in /opt/nodes/NodeC on debug port 5009
Starting corda-webserver.jar in /opt/nodes/NodeC on debug port 5010
Starting corda.jar in /opt/nodes/NodeZ on debug port 5011
Starting corda-webserver.jar in /opt/nodes/NodeZ on debug port 5012
Started 8 processes
Finished starting nodes
[user#host nodes]$ Error opening zip file or JAR manifest missing : /home/user/.capsule/apps/net.corda.webserver.WebServer_0.12.1/quasar-core-0.7.6-jdk8.jar
Error occurred during initialization of VM
agent library failed to init: instrument
Error opening zip file or JAR manifest missing : /home/user/.capsule/apps/net.corda.node.Corda_0.12.1/quasar-core-0.7.6-jdk8.jar
Error occurred during initialization of VM
agent library failed to init: instrument
Error opening zip file or JAR manifest missing : /home/user/.capsule/apps/net.corda.webserver.WebServer_0.12.1/quasar-core-0.7.6-jdk8.jar
Error occurred during initialization of VM
agent library failed to init: instrument
Error opening zip file or JAR manifest missing : /home/user/.capsule/apps/net.corda.node.Corda_0.12.1/quasar-core-0.7.6-jdk8.jar
Error occurred during initialization of VM
agent library failed to init: instrument
Error opening zip file or JAR manifest missing : /home/user/.capsule/apps/net.corda.webserver.WebServer_0.12.1/quasar-core-0.7.6-jdk8.jar
Error occurred during initialization of VM
agent library failed to init: instrument
Error opening zip file or JAR manifest missing : /home/user/.capsule/apps/net.corda.node.Corda_0.12.1/quasar-core-0.7.6-jdk8.jar
Error occurred during initialization of VM
agent library failed to init: instrument
Listening for transport dt_socket at address: 5012
Unknown command line arguments: no-local-shell is not a recognized option
Listening for transport dt_socket at address: 5011
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/user/.capsule/apps/net.corda.node.Corda_0.12.1/log4j-slf4j-impl-2.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/nodes/NodeZ/dependencies/log4j-slf4j-impl-2.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
What am I missing to start these nodes?
Did you build the nodes on Mac, than transfer them to Linux? If so, try building the nodes directly on the Linux machine.
I have a problem on a single node hadoop POC environment (Ubuntu 14.04) when I am running R and connecting to Spark through SparkR 1.5. I ran this test a couple of times before and I had no issues with this until today.
My goal is to use SparkR to Connect to Hive and bring in a table (ultimately to write df results back to Hive). This is the work coming out of the R Console from RStudio. I am totally stumped and any advice to help is appreciated.
library(SparkR, lib.loc="/usr/hdp/2.3.6.0-3796/spark/R/lib/")
sc <- sparkR.init(sparkHome = "/usr/hdp/2.3.6.0-3796/spark/")
Launching java with spark-submit command /usr/hdp/2.3.6.0-3796/spark//bin/spark-submit sparkr-shell /tmp/RtmpdGojW1/backend_portb8b949c8f0e2
17/08/15 15:50:18 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:19 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:19 INFO SparkContext: Running Spark version 1.5.2
17/08/15 15:50:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/15 15:50:20 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:20 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 10.100.0.11 instead (on interface eth0)
17/08/15 15:50:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/08/15 15:50:20 INFO SecurityManager: Changing view acls to: rstudio
17/08/15 15:50:20 INFO SecurityManager: Changing modify acls to: rstudio
17/08/15 15:50:20 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(rstudio); users with modify permissions: Set(rstudio)
17/08/15 15:50:22 INFO Slf4jLogger: Slf4jLogger started
17/08/15 15:50:22 INFO Remoting: Starting remoting
17/08/15 15:50:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.100.0.11:43827]
17/08/15 15:50:23 INFO Utils: Successfully started service 'sparkDriver' on port 43827.
17/08/15 15:50:23 INFO SparkEnv: Registering MapOutputTracker
17/08/15 15:50:23 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:23 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:23 INFO SparkEnv: Registering BlockManagerMaster
17/08/15 15:50:23 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-bea658dc-145f-48a6-bb28-6f05af529547
17/08/15 15:50:23 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
17/08/15 15:50:23 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:23 INFO HttpFileServer: HTTP File server directory is /tmp/spark-6b719b9d-3d54-48bc-8894-cd2ddf9b0755/httpd-e7371ee1-5574-476d-9d53-679a9781af2d
17/08/15 15:50:23 INFO HttpServer: Starting HTTP Server
17/08/15 15:50:23 INFO Server: jetty-8.y.z-SNAPSHOT
17/08/15 15:50:23 INFO AbstractConnector: Started SocketConnector#0.0.0.0:39275
17/08/15 15:50:23 INFO Utils: Successfully started service 'HTTP file server' on port 39275.
17/08/15 15:50:23 INFO SparkEnv: Registering OutputCommitCoordinator
17/08/15 15:50:23 INFO Server: jetty-8.y.z-SNAPSHOT
17/08/15 15:50:24 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
17/08/15 15:50:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/08/15 15:50:24 INFO SparkUI: Started SparkUI at http://10.100.0.11:4040
17/08/15 15:50:24 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:24 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:24 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:50:24 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/08/15 15:50:24 INFO Executor: Starting executor ID driver on host localhost
17/08/15 15:50:24 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43075.
17/08/15 15:50:24 INFO NettyBlockTransferService: Server created on 43075
17/08/15 15:50:24 INFO BlockManagerMaster: Trying to register BlockManager
17/08/15 15:50:24 INFO BlockManagerMasterEndpoint: Registering block manager localhost:43075 with 530.0 MB RAM, BlockManagerId(driver, localhost, 43075)
17/08/15 15:50:24 INFO BlockManagerMaster: Registered BlockManager
hiveContext <- sparkRHive.init(sc)
17/08/15 15:51:17 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:51:19 INFO HiveContext: Initializing execution hive, version 1.2.1
17/08/15 15:51:19 INFO ClientWrapper: Inspected Hadoop version: 2.7.1.2.3.6.0-3796
17/08/15 15:51:19 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.1.2.3.6.0-3796
17/08/15 15:51:19 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:51:20 INFO metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/08/15 15:51:20 INFO metastore: Connected to metastore.
17/08/15 15:51:21 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
17/08/15 15:51:22 INFO SessionState: Created local directory: /tmp/a4f76c27-cf73-45bf-b873-a0e97ca43309_resources
17/08/15 15:51:22 INFO SessionState: Created HDFS directory: /tmp/hive/rstudio/a4f76c27-cf73-45bf-b873-a0e97ca43309
17/08/15 15:51:22 INFO SessionState: Created local directory: /tmp/rstudio/a4f76c27-cf73-45bf-b873-a0e97ca43309
17/08/15 15:51:22 INFO SessionState: Created HDFS directory: /tmp/hive/rstudio/a4f76c27-cf73-45bf-b873-a0e97ca43309/_tmp_space.db
17/08/15 15:51:22 INFO HiveContext: default warehouse location is /user/hive/warehouse
17/08/15 15:51:22 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/08/15 15:51:22 INFO ClientWrapper: Inspected Hadoop version: 2.7.1.2.3.6.0-3796
17/08/15 15:51:22 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.1.2.3.6.0-3796
17/08/15 15:51:22 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
17/08/15 15:51:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/15 15:51:25 INFO metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/08/15 15:51:25 INFO metastore: Connected to metastore.
17/08/15 15:51:27 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
17/08/15 15:51:27 INFO SessionState: Created local directory: /tmp/16b5f51f-f570-4fc0-b3a6-eda3edd19b59_resources
17/08/15 15:51:27 INFO SessionState: Created HDFS directory: /tmp/hive/rstudio/16b5f51f-f570-4fc0-b3a6-eda3edd19b59
17/08/15 15:51:27 INFO SessionState: Created local directory: /tmp/rstudio/16b5f51f-f570-4fc0-b3a6-eda3edd19b59
17/08/15 15:51:27 INFO SessionState: Created HDFS directory: /tmp/hive/rstudio/16b5f51f-f570-4fc0-b3a6-eda3edd19b59/_tmp_space.db
showDF(sql(hiveContext, "USE MyHiveDB"))
Error: is.character(x) is not TRUE
showDF(sql(hiveContext, "SELECT * FROM table"))
Error: is.character(x) is not TRUE
Solved. The issue here is exactly what cricket_007 suggested with the databrick link. There were some packages being used in the R Session that conflicted with SparkR instance.
By detaching them from the current R Session, this resolved the issue and got the code to work.
The packages to detach were:
plyr
dplyr
dbplyr
I have a R script which works perfectly fine in R Colsole ,but when I am running in Hadoop streaming it is failing with the below error in Map phase .Find the Task attempts log
The Hadoop Streaming Command I have :
/home/Bibhu/hadoop-0.20.2/bin/hadoop jar \
/home/Bibhu/hadoop-0.20.2/contrib/streaming/*.jar \
-input hdfs://localhost:54310/user/Bibhu/BookTE1.csv \
-output outsid -mapper `pwd`/code1.sh
stderr logs
Loading required package: class
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Calls: read.csv -> read.table
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
syslog logs
2013-07-03 19:32:36,080 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2013-07-03 19:32:36,654 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-07-03 19:32:36,675 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-07-03 19:32:36,835 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2013-07-03 19:32:36,835 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2013-07-03 19:32:36,899 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/home/Bibhu/Downloads/SentimentAnalysis/Sid/smallFile/code1.sh]
2013-07-03 19:32:37,256 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=0/1
2013-07-03 19:32:38,509 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2013-07-03 19:32:38,509 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2013-07-03 19:32:38,557 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2013-07-03 19:32:38,631 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
write hadoopStreamming jar with full version like hadoop-streaming-1.0.4.jar
specify separate file path for mapper & reducer with -file option
tell hadoop which is your mapper & reducer code with -mapper & -reducer option
for more ref see Running WordCount on Hadoop using R script
You need to find the logs from your mappers and reducers, since this is the place where the job is failing (as indicated by java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1). This says that your R script crashed.
If you are using the Hortonworks Hadoop distribuion, the easiest way is to open your jobhistory. It should be at http://127.0.0.1:19888/jobhistory . It should be possible to find the log in the filesystem using the command line as well, but I haven't yet found where.
Open http://127.0.0.1:19888/jobhistory in your web browser
Click on the Job ID of the failed job
Click the number indicating the failed job count
Click an attempt link
Click the logs link
You should see a page which looks something like
Log Type: stderr
Log Length: 418
Traceback (most recent call last):
File "/hadoop/yarn/local/usercache/root/appcache/application_1404203309115_0003/container_1404203309115_0003_01_000002/./mapper.py", line 45, in <module>
mapper()
File "/hadoop/yarn/local/usercache/root/appcache/application_1404203309115_0003/container_1404203309115_0003_01_000002/./mapper.py", line 37, in mapper
for record in reader:
_csv.Error: newline inside string
This is an error from my Python script, the errors from R look a bit different.
source: http://hortonworks.com/community/forums/topic/map-reduce-job-log-files/
I received this same error tonight, while also developing Map Reduce Streaming jobs with R.
I was working on a 10 node cluster, each with 12 cores, and tried to supply at submission time:
-D mapred.map.tasks=200\
-D mapred.reduce.tasks=200
The job completed successfully though when I changed these to
-D mapred.map.tasks=10\
-D mapred.reduce.tasks=10
This was a mysterious fix, and perhaps more context will arise this evening. But if any readers can elucidate, please do!