I have installed the sparkR package and I am able to run other computation jobs like pi count or numbers of word counts in a document .But when I am trying to initiate sparkRSql job,it gives an error .Can anyone help me out ?
I am using R version 3.2.0 and Spark 1.3.1
> library(SparkR)
> sc1 <- sparkR.init(master="local")
Launching java with command /usr/lib/jvm/java-7-oracle/bin/java -Xmx1g -cp '/home/himaanshu/R/x86_64-pc-linux-gnu-library/3.2/SparkR/sparkr-assembly-0.1.jar:' edu.berkeley.cs.amplab.sparkr.SparkRBackend /tmp/Rtmp0tAX4W/backend_port614e1c1c38f6
15/07/09 18:05:51 WARN Utils: Your hostname, himaanshu-Inspiron-5520 resolves to a loopback address: 127.0.0.1; using 172.17.42.1 instead (on interface docker0)
15/07/09 18:05:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/07/09 18:05:52 INFO Slf4jLogger: Slf4jLogger started
15/07/09 18:05:54 WARN SparkContext: Using SPARK_MEM to set amount of memory to use per executor process is deprecated, please use spark.executor.memory instead.
> sqlContext <- sparkRSQL.init(sc1)
Error: could not find function "sparkRSQL.init"
````
You SparkR version is wrong. sparkr-assembly-0.1.jar has not contained sparkRSQL.init yet.
Related
In my environment, I have 2 different versions of Spark (2.2.0 and 1.6.0). I am trying to connect to Spark 1.6.0 from R and I am not able to establish a connection with the guidelines given in the documentation.
I am using:
spark_connect(
master = "yarn-client",
config = spark_config(), version = "1.6.0",
spark_home = '/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/spark')
But I am getting the below error:
Error in force(code) :
Failed during initialize_connection: Failed to detect version from SPARK_HOME or SPARK_HOME_VERSION. Try passing the spark version explicitly.
Log: /tmp/RtmplCevTH/file1b51126856258_spark.log
I am able to connect to Spark 2.2.0 without any problem and am able to query the data as well.
Note sure what I was doing wrong.
When trying to connect to Spark cluster using SparkR in RStudio:
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "/usr/lib/spark/spark-2.1.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Starting a sparkR session
sparkR.session(master = "spark://myIpAddress.eu-west-1.compute.internal:7077")
I am getting the following error message:
Spark package found in SPARK_HOME: /usr/lib/spark/spark-2.1.1-bin-hadoop2.6
Launching java with spark-submit command /usr/lib/spark/spark-2.1.1-bin-hadoop2.6/bin/spark-submit sparkr-shell /tmp/RtmpMWFrt6/backend_port71e6731ea922
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/24 16:17:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/24 16:17:37 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Java ref type org.apache.spark.sql.SparkSession id 1
In Spark master I see SparkR application is running, but no sc variable is available. It feels that this error might be related to metastore, but not sure. Does anyone know what prevents my spark session from starting correctly?
Thanks, Michal
1- Removed the linked file using sudo rm -R /etc/spar/conf/hive.xml
2- Again linked the file using sudo ln -s /etc/hive/conf/hive-site.xml /etc/spark/conf/hive-site.xml
I have created IBM BigInsights service with hadoop cluster of 5 nodes(including Apache Spark with SparkR). I trying to use SparkR to connect cloudant db and get some data and do some processing.
SparkR job(R script) submit using spark-submit fails in BigInsights Hadoop cluster.
I have created SparkR script and ran the following code,
-bash-4.1$ spark-submit --master local[2] test_sparkr.R
16/08/07 17:43:40 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
Error: could not find function "sparkR.init"
Execution halted
-bash-4.1$
Content of test_sparkr.R file is:
# Creating SparkConext and connecting to Cloudant DB
sc <- sparkR.init(sparkEnv = list("cloudant.host"="<<cloudant-host-name>>","<<><<cloudant-user-name>>>","cloudant.password"="<<cloudant-password>>", "jsonstore.rdd.schemaSampleSize"="-1"))
# Database to be connected to extract the data
database <- "testdata"
# Creating Spark SQL Context
sqlContext <- sparkRSQL.init(sc)
# Creating DataFrame for the "testdata" Cloudant DB
testDataDF <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')
How to install the spark-cloudant connector in IBM BigInsights and resolve the issue. Kindly do the needful. Help would be much appreciated.
I believe that the spark-cloudant connector isn’t for R yet.
Hopefully I can update this answer when it is!
I'm trying to connect to the cluster hosted on EC2 machine from R and getting the same error when trying both on Windows and Mac:
> h2o.init(ip = "<Public IP>")
Connection successful!
ERROR: Unexpected HTTP Status code: 404 Not Found (url = http://<Public IP>:54321/3/Cloud?skip_ticks=true)
Error: lexical error: invalid char in json text.
<!DOCTYPE html> <html lang="en"
(right here) ------^
Cluster is reachable at http://<Public IP>:54321/
Starting a local cluster with h2o.init() also works fine in R, so the problem is only when trying to connect to remote one.
I've seen the following issue marked as resolved, but it doesn't help in my case. Have anybody experienced anything similar?
UPD: The answer was very simple. It turns out that the code example given in their guide for EC2 is outdated and uses the old version of H2O. Using the most recent version (3.9.1.5555 at the moment) on EC2 machines has resolved the issue.
To elaborate on the OP's update, when using a remote cluster:
Make sure you install the most recent version (check the S3 download page for the redirect to the release number). In the example below, this is 3.13.0.3908:
wget http://s3.amazonaws.com/h2o-release/h2o/master/3908/h2o-3.13.0.3908.zip
unzip h2o-3.13.0.3908.zip
mv h2o-3.13.0.3908 h2o
cd h2o
java -Xmx4g -jar h2o.jar
You then need to install the version of h2o-R that corresponds to this version. (The correct version is likely not the CRAN version.) Otherwise you will get an error like:
Error in h2o.init(ip = "XXX.XX.XX.XXX", startH2O = FALSE) :
Version mismatch! H2O is running version 3.13.0.3908 but h2o-R package is version 3.10.4.6.
Install the matching h2o-R version from - http://h2o-release.s3.amazonaws.com/h2o/master/3908/index.html
So you need to note the version number H2O is running (in the above example, 3908), make sure you have previously removed any existing h2o-R package (see here for more info), and then do:
install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/master/3908/R")
Now it should work:
library('h2o')
remoteH2O <- h2o.init(ip='XXX.XX.XX.XXX', startH2O=FALSE) # Connection successful!
the following issue is coming when trying to connect Hive 2 (kerberoes authenticat is enabled) using R rjdbc. used simba driver to connect to hive.
hiveConnection <- dbConnect(hiveJDBC, "jdbc:hive2://xxxx:10000/default;AuthMech=1;KrbRealm=xx.yy.com;KrbHostFQDN=dddd.yy.com;KrbServiceName=hive")
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.sql.SQLException: [Simba]HiveJDBCDriver Invalid operation: Unable to obtain Principal Name for authentication ;
make sure kinit is issued and kerberoes ticket is generated using klist
right Java version for the given R version (32/64 bit) available on the class-path
right slf4j jars available based on your java version
All these steps should resolve the issue assuming your code does not have logic issues.