SparkR Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState' - r

I am installing SparkR in my Windows 8.1 from this tutorial https://www.linkedin.com/pulse/setting-up-sparkr-windows-machine-ramabhadran-kapistalam. I ended it so I guess it's well implemented.
The problem is when I try to run an example with a simple Data Frame:
Error in handleErrors(returnStatus, conn) :
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:552)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:307)
at org.apache.spark.sql.api.r.SQLUtils$.createDF(SQLUtils.scala:139)
at org.apache.spark.sql.api.r.SQLUtils.createDF(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at
This is my code in R:
Sys.setenv(SPARK_HOME = "C:/Spark/spark-2.1.1-bin-hadoop2.7")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
sparkR.session(appName = "SparkR-DataFrame-example")
df <- as.DataFrame(faithful)
And i saw a solution that I had to configure the sparkr session by adding:
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "1g", spark.sql.warehouse.dir = "file:///somelocaldirectory"))
I tried to edit the spark.sql.warehouse.dir with a data file but the error persisted

I had exact same problem, starting R-studio or R as administrator solved this issue.
to start as administrator, right click on R-studio or R and select run as administrator and then your commands should work just fine.

Related

How do I register custom JDBC dialect in Rstudio?

I'm trying to analyze bigquery data in Rstudio server running on a google dataproc cluster. However, due to the memory limitations of Rstudio, I intend to run queries on this data in sparklyr but I haven't had any success importing the data directly into the spark cluster from bigquery.
I'm using google's official JDBC connectivity driver:
ODBC and JDBC drivers for BigQuery
I also have the following software versions running:
Google Dataproc: 2.0-Debian 10
Sparklyr: Spark 3.2.1 Hadoop 3.2
R version 4.2.1
I also had to replace the following spark jars with the versions being used by the JDBC connectivity driver above or added them where they were non-existent:
failureaccess-1.0.1 was added
protobuff-java-3.19.4 replaced 2.5.0
guava 31.1-jre replaced 14.0.1
Below is my code using the spark_read_jdbc function to retrieve a dataset from big query
conStr <- "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=xxxx;OAuthType=3;AllowLargeResults=1;"
spark_read_jdbc(sc = spkc,
name = "events_220210",
memory = FALSE,
options = list(url = conStr,
driver = "com.simba.googlebigquery.jdbc.Driver",
user = "rstudio",
password = "xxxxxx",
dbtable = "dataset.table"))
The table gets imported into the spark cluster but when I try to preview it, the following error message is received
ERROR sparklyr: Gateway (551) failed calling collect on sparklyr.Utils: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4) (faucluster1-w-0.europe-west2-c.c.ga4-warehouse-342410.internal executor 2): java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to long.
at com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.simba.googlebigquery.utilities.conversion.TypeConverter.toLong(Unknown Source)
at com.simba.googlebigquery.jdbc.common.SForwardResultSet.getLong(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$9(JdbcUtils.scala:446)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$9$adapted(JdbcUtils.scala:445)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:367)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:349)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
When I try to import the data via an SQL query, e.g
SELECT date, name, age FROM dataset.tablename
I end up with a table looking like this:
date
name
age
date
name
age
date
name
age
date
name
age
I've read on several posts that the solution to this is to register custom JDBC dialect but I have no idea how to do this; what platform to do it on, or if it's possible to do it within Rstudio. Links to any materials that would help me solve this problem would be appreciated.

Why are some jsonl files failing to load in SparklyR

I am currently trying to read in some jsonl files using SparklyR v 1.3.1 with Spark 2.3.3. While some files read in fine, I am struggling with others, using exactly the same code. Long-ish details below including error messages and packages/code being used.
library(sparklyr)
library(sparklyr.nested)
library(dplyr)
sc <- spark_connect(master = "local")
june1 <- spark_read_json(sc, "june1-aa.jsonl")
june <- spark_read_json(sc, "janetweets_june24.jsonl")
Error: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
The first file appears to read in ok, but any attempt to view the file meets with the following error, and "no tables" is displayed in the connections window, as opposed to the file structure for similar files which can be read in fine.
Error in value[[3L]](cond) :
Failed to fetch data: java.lang.NullPointerException
at sparklyr.Collectors$.collectLongArr(collectors.scala:87)
at sparklyr.Collectors$$anonfun$mkColumnCtx$17.apply(collectors.scala:224)
at sparklyr.Collectors$$anonfun$mkColumnCtx$17.apply(collectors.scala:224)
at sparklyr.Collectors$ColumnCtx.collect(collectors.scala:183)
at sparklyr.Utils$.sparklyr$Utils$$collectRows(utils.scala:90)
at sparklyr.Utils$.collect(utils.scala:114)
at sparklyr.Utils.collect(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke.invoke(invoke.scala:147)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
at sparklyr.StreamHandler.read(stream.scala:61)
at sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(h
Owing to the size of these files, these are the first lines on each file, as viewed in terminal.
Sample data in pastebin: https://pastebin.com/y3Zevnpv.
I have tried updating my package to the latest version, deleting and reinstalling, and these files are definitely in the jsonlines format. This files were pulled from Twitter using the twarc command line tool from a Windows machine. Pls note: I have removed some URLs from above data owing to Stack Overflow guidelines. Thanks!

Creating data frames in SparkR?

I am new here....so sorry if I ask naive questions !!!
I am using SparkR in Rstudio.
R version 3.3.2
Spark version 2.0.2
I am able to successfully launch Spark in R studio and I can see using webUI. localhost:4040 that my spark is up and running.
But once I try to create data frame it gives error something like this:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:474)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:488)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:480)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7
Can anybody help me with this....Thanks in advance :)
Thank you Guys. I was missing one file, which can be downloaded from git, here is link: https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin
Actually one of my friends also had same problem, just add this file and it should work fine.

How to read csv into sparkR ver 1.4?

As a new version of spark (1.4) was released there appeared to be a nice frontend interfeace to spark from R package named sparkR. On the documentation page of R for spark there is a command that enables to read json files as an RDD objects
people <- read.df(sqlContext, "./examples/src/main/resources/people.json", "json")
I am trying to read a data from a .csv file like it is described on this revolutionanalitics' blog
# Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv
# Launch SparkR using
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
The note says I need a spark-csv package to enable this operation. So I downloaded this package from this github repo with this command:
$ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3
But then I encountered such error while trying to read a .csv file.
> flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
15/07/03 12:52:41 ERROR RBackendHandler: load on 1 failed
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230)
... 25 more
Error: returnStatus == 0 is not TRUE
Any idea on what this error means and how to solve this?
Of course I could try to read .csv in a standard way such as:
read.table("data.csv") -> flights
and then I can transform R data.frame into spark's DataFrame like this:
flightsDF <- createDataFrame(sqlContext, flights)
But this isn't the way I like it and it is really time consuming.
You have to start sparkR console each time like this:
sparkR --packages com.databricks:spark-csv_2.10:1.0.3
If you are using Rstudio:
library(SparkR)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)
does the trick. Make sure the version you specify for spark-csv matches the one you downloaded.
Make sure that you install sparkr from within spark using:
install.packages("C:/spark/R/lib/sparkr.zip", repos = NULL)
and not fro github
that solved it for me.

Accessing Google Storage with SparkR on bdutil deployed cluster

I've been using bdutil for a year now, with hadoop and spark and this is quite perfect!
Now I've got a little problem trying to get SparkR to work with Google Storage as HDFS.
Here is my setup :
- bdutil 1.2.1
- I have deployed a cluster with 1 master and 1 worker with Spark 1.3.0 installed
- Installed R and SparkR on both master and worker
When I run SparkR on master node, I'm trying to point a directory on my GS bucket serveral ways:
1) By setting the gs Filesystem scheme
> file <- textFile(sc, "gs://xxxxx/dir/")
> count(file)
15/05/27 12:02:02 WARN LoadSnappy: Snappy native library is available
15/05/27 12:02:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/27 12:02:02 WARN LoadSnappy: Snappy native library not loaded
collect on 5 failed with java.lang.reflect.InvocationTargetException
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.handleMethodCall(SparkRBackendHandler.scala:111)
at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.channelRead0(SparkRBackendHandler.scala:58)
at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.channelRead0(SparkRBackendHandler.scala:19)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1383)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at edu.berkeley.cs.amplab.sparkr.BaseRRDD.getPartitions(RRDD.scala:31)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1511)
at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:312)
at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
... 25 more
Error: returnStatus == 0 is not TRUE
2) With a HDFS URL
> file <- textFile(sc, "hdfs://hadoop-stage-m:8020/dir/")
> count(file)
collect on 10 failed with java.lang.reflect.InvocationTargetException
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.handleMethodCall(SparkRBackendHandler.scala:111)
at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.channelRead0(SparkRBackendHandler.scala:58)
at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.channelRead0(SparkRBackendHandler.scala:19)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hadoop-stage-m:8020/dir
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at edu.berkeley.cs.amplab.sparkr.BaseRRDD.getPartitions(RRDD.scala:31)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1511)
at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:312)
at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
... 25 more
Error: returnStatus == 0 is not TRUE
3) With a path as I would use with Scala on my other Spark jobs : quite the same error as 2)
I'm sure I'm missing an obvious step. If there is anyone who can help me on this matter, it would be great!
Thanks,
PS: I'm 100% sure that gcs connector is working on a classic Scala job!
Short Answer
You need core-site.xml, hdfs-site.xml, etc., and the gcs-connector-1.3.3-hadoop1.jar on your classpath. Accomplish this with:
export YARN_CONF_DIR=/home/hadoop/hadoop-install/conf:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.3-hadoop1.jar
./sparkR
You may also want other spark-env.sh settings; consider additionally running:
source /home/hadoop/spark-install/conf/spark-env.sh
Before ./sparkR. If you're calling sparkR.init manually in R, then this isn't as necessary since you'll pass params like master directly.
Other possible pitfalls:
Make sure your default Java is Java 7. If it's Java 6, run sudo update-alternatives --config java and select Java 7 as default.
When building sparkR make sure to set Spark version: SPARK_VERSION=1.3.0 ./install-dev.sh
Long Answer
Generally, the "No FileSystem for scheme" error means we need to make sure core-site.xml is on the classpath; a second error I ran into after fixing the classpath was "java.lang.ClassNotFoundException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem" which means we also need to add gcs-connector-1.3.3.jar to the classpath. Looking through the SparkR helper scripts, the main sparkR binary calls sparkR.init with the following:
sc <- sparkR.init(Sys.getenv("MASTER", unset = ""))
The MASTER environment variable is commonly found in the spark-env.sh script, and indeed bdutil populates the MASTER environment variable under /home/hadoop/spark-install/conf/spark-env.sh. Typically, this should indicate that simply adding source /home/hadoop/spark-install/conf/spark-env.sh should sufficiently populate the necessary settings for SparkR, but if we peek inside the sparkR definition, we see this:
#' Initialize a new Spark Context.
#'
#' This function initializes a new SparkContext.
#'
#' #param master The Spark master URL.
#' #param appName Application name to register with cluster manager
#' #param sparkHome Spark Home directory
#' #param sparkEnvir Named list of environment variables to set on worker nodes.
#' #param sparkExecutorEnv Named list of environment variables to be used when launching executors.
#' #param sparkJars Character string vector of jar files to pass to the worker nodes.
#' #param sparkRLibDir The path where R is installed on the worker nodes.
#' #param sparkRBackendPort The port to use for SparkR JVM Backend.
#' #export
#' #examples
#'\dontrun{
#' sc <- sparkR.init("local[2]", "SparkR", "/home/spark")
#' sc <- sparkR.init("local[2]", "SparkR", "/home/spark",
#' list(spark.executor.memory="1g"))
#' sc <- sparkR.init("yarn-client", "SparkR", "/home/spark",
#' list(spark.executor.memory="1g"),
#' list(LD_LIBRARY_PATH="/directory of JVM libraries (libjvm.so) on workers/"),
#' c("jarfile1.jar","jarfile2.jar"))
#'}
sparkR.init <- function(
master = "",
appName = "SparkR",
sparkHome = Sys.getenv("SPARK_HOME"),
sparkEnvir = list(),
sparkExecutorEnv = list(),
sparkJars = "",
sparkRLibDir = "") {
<...>
cp <- paste0(jars, collapse = collapseChar)
yarn_conf_dir <- Sys.getenv("YARN_CONF_DIR", "")
if (yarn_conf_dir != "") {
cp <- paste(cp, yarn_conf_dir, sep = ":")
}
<...>
if (Sys.getenv("SPARKR_USE_SPARK_SUBMIT", "") == "") {
launchBackend(classPath = cp,
mainClass = "edu.berkeley.cs.amplab.sparkr.SparkRBackend",
args = path,
javaOpts = paste("-Xmx", sparkMem, sep = ""))
} else {
# TODO: We should deprecate sparkJars and ask users to add it to the
# command line (using --jars) which is picked up by SparkSubmit
launchBackendSparkSubmit(
mainClass = "edu.berkeley.cs.amplab.sparkr.SparkRBackend",
args = path,
appJar = .sparkREnv$assemblyJarPath,
sparkHome = sparkHome,
sparkSubmitOpts = Sys.getenv("SPARKR_SUBMIT_ARGS", ""))
}
This tells us three things:
The default sparkR script fails to pass sparkJars, so there doesn't appear to be a current convenient way to pass libjars as flags.
There's a TODO to deprecate the sparkJars param anyways.
Aside from the sparkJars param, the only other thing going into the cp/classPath argument is YARN_CONF_DIR (unless I'm missing some other source of classpath additions, or if I'm using a different version of sparkR than you). Also, fortunately, it appears to use YARN_CONF_DIR even if you're not planning to run on YARN.
In all, this shows you probably want at least the variables in /home/hadoop/spark-install/conf/spark-env.sh since at least some of the hooks appear to look for environment variables commonly defined there, and secondly we should be able to hack YARN_CONF_DIR to specify both the classpath to make it find core-site.xml as well as to add gcs-connector-1.3.3.jar to the classpath.
So, the answer to your question is:
export YARN_CONF_DIR=/home/hadoop/hadoop-install/conf:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.3-hadoop1.jar
./sparkR
You may need to change the /home/hadoop/hadoop-install/lib/gcs-connector-1.3.3-hadoop1.jar part if you're using hadoop2 or some other gcs-connector version. That command fixes both the HDFS access as well as finding the fs.gs.impl for the gcs-connector as well as making sure the actual gcs-connector jar is on the classpath. It doesn't pull in spark-env.sh so you might find it defaulting to running with MASTER=local. You may consider running the following, assuming your worker nodes have also properly installed SparkR:
source /home/hadoop/spark-install/conf/spark-env.sh
export YARN_CONF_DIR=/home/hadoop/hadoop-install/conf:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.3-hadoop1.jar
./sparkR
A couple additional caveats based on what I encountered:
You may find your R installation set an older Java version. If you run into something like "unsupported major.minor version 51.0", run sudo update-alternatives --config java and make Java 7 the default.
If you're using Spark 1.3.0, if you're using SparkR's install-dev.sh, Spark may erroneously hang with "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory" when in fact dispatchers are fast-failing with serialVersionUID mismatches, which you can see in /hadoop/spark/logs/*Master*.out - the solution is to make sure you run install-dev.sh with the right Spark version set: SPARK_VERSION=1.3.0 ./install-dev.sh

Resources