How to make sparkR run - r

I have been trying to make SparkR work without success.
Read previous questions, blogs, and yet haven't been able to make it work.
First I had issues installing SparkR, finally I think I installed it, but then cannot make it run.
Here is my detailed code with different Options I tried to make it run.
Currently using Rstudio with R 3.6.0 version.
Any help will be appreciated!!
#***************************#
#Installing Spark Option 1
#***************************#
install.packages("SparkR")
'''
Does not work
'''
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
#***************************#
#Installing Spark Option 2
#***************************#
#Find Spark Versions
jsonlite::fromJSON("https://api.github.com/repos/apache/spark/tags")$name
if (!require('devtools')) install.packages('devtools')
devtools::install_github('apache/spark#v2.4.6', subdir='R/pkg')
Sys.setenv(SPARK_HOME='D:/spark-2.3.1-bin-hadoop2.7')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
'''
Installation didnt work
'''
#***************************#
#Installation Spark Option 3
#***************************#
install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.3.1")
install.packages("https://cran.r-project.org/src/contrib/Archive/SparkR/SparkR_2.3.0.tar.gz", repos = NULL, type="source")
library(SparkR)
'''
One of 2 installations worked
'''
#***************************#
#Starting Spark Option 1
#***************************#
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R","lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
'''
Spark package found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Launching java with spark-submit command D:/spark-2.3.1-bin-hadoop2.7/bin/spark-submit2.cmd --driver-memory "2g" sparkr-shell C:\Users\FELIPE~1\AppData\Local\Temp\RtmpKOxYkx\backend_port34a0263f43f5
Error in if (len > 0) { : argumento tiene longitud cero
'''
#***************************#
#Starting Spark Option 2
#***************************#
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
# initializing Spark context
sc <- sparkR.init(sparkHome = "'D:/spark-2.3.1-bin-hadoop2.7'",
sparkEnvir = sparkEnvir)
'''
Error in sparkR.sparkContext(master, appName, sparkHome, convertNamedListToEnv(sparkEnvir), :
JVM is not ready after 10 seconds
Además: Warning message:
sparkR.init is deprecated.
Use sparkR.session instead.
See help("Deprecated")
'''
#***************************#
#Starting Spark Option 3
#***************************#
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
# initializing Spark context
sc <- sparkR.session(sparkHome = "'D:/spark-2.3.1-bin-hadoop2.7'",
sparkEnvir = sparkEnvir)
'''
Spark not found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Spark package found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Launching java with spark-submit command D:/spark-2.3.1-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\FELIPE~1\AppData\Local\Temp\RtmpKOxYkx\backend_port34a082b15e1
Error in if (len > 0) { : argumento tiene longitud cero
'''

Related

How to solve ".jcall(conn#jc, "Ljava/sql/Statement;", "createStatement") : " error in R

I run a R code like this:
library(rugarch)
library(forecast)
library(blah blah blah)
set.seed(313)
config <- config::get(file = "/home/config.yml", use_parent = FALSE)
jdbcDriver = JDBC("oracle.jdbc.OracleDriver", classPath="ojdbc6.jar")
jdbcConnection = dbConnect(jdbcDriver, config$url, config$dbuser, config$dbpass)
I get below error:
Error in .jcall(conn#jc, "Ljava/sql/Statement;", "createStatement") :
RcallMethod: attempt to call a method of a NULL object.
I use:
R version 3.6.3 (2020-02-29)
Linux "Ubuntu" VERSION="20.04 LTS (Focal Fossa)"
I run the R in a docker container.
How can I solve the problem?

Trouble getting H2O to work with Sparklyr

I am trying to get H2O working with Sparklyr on my spark cluster (yarn)
spark_version(sc) = 2.4.4
My spark cluster is running V2.4.4
According to this page the compatible version with my spark is 2.4.5 for Sparkling Water and the H2O release is rel-xu patch version 3. However when I install this version I am prompted to update my H2O install to the next release (REL-ZORN). Between the H2O guides and the sparklyr guides it's very confusing and contradictory at times.
Since this is a yarn deployment and not local, unfortunately I can't provide a repex to help with trobleshooting.
url <- "http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.4/5/sparkling-water-2.4.5.zip"
download.file(url = url,"sparkling-water-2.4.5.zip")
unzip("sparkling-water-2.4.5.zip")
# RUN THESE CMDs FROM THE TERMINAL
cd sparkling-water-2.4.5
bin/sparkling-shell --conf "spark.executor.memory=1g"
# RUN THESE FROM WITHIN RSTUDIO
install.packages("sparklyr")
library(sparklyr)
# REMOVE PRIOR INSTALLS OF H2O
detach("package:rsparkling", unload = TRUE)
if ("package:h2o" %in% search()) { detach("package:h2o", unload = TRUE) }
if (isNamespaceLoaded("h2o")){ unloadNamespace("h2o") }
remove.packages("h2o")
# INSTALLING REL-ZORN (3.36.0.3) WHICH IS REQUIRED FOR SPARKLING WATER 3.36.0.3
install.packages("h2o", type = "source", repos = "https://h2o-release.s3.amazonaws.com/h2o/rel-zorn/3/R")
# INSTALLING FROM S3 SINCE CRAN NO LONGER SUPPORTED
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-2.4/3.36.0.3-1-2.4/R")
# AS PER THE GUIDE
options(rsparkling.sparklingwater.version = "2.4.5")
library(rsparkling)
# SPECIFY THE CONFIGURATION
config <- sparklyr::spark_config()
config[["spark.yarn.queue"]] <- "my_data_science_queue"
config[["sparklyr.backend.timeout"]] <- 36000
config[["spark.executor.cores"]] <- 32
config[["spark.driver.cores"]] <- 32
config[["spark.executor.memory"]] <- "40g"
config[["spark.executor.instances"]] <- 8
config[["sparklyr.shell.driver-memory"]] <- "16g"
config[["spark.default.parallelism"]] <- "8"
config[["spark.rpc.message.maxSize"]] <- "256"
# MAKE A SPARK CONNECTION
sc <- sparklyr::spark_connect(
master = "yarn",
spark_home = "/opt/mapr/spark/spark",
config = config,
log = "console",
version = "2.4.4"
)
When I try to establish a H2O context using the next chunk I get the following error
h2o_context(sc)
Error in h2o_context(sc) : could not find function "h2o_context"
Any pointers as to where I'm going wrong would be greatly appreciated.
See this tutorial please. The newer versions of Rsparkling use {H2OContext.getOrCreate(h2oConf)} instead of {h2o_context(sc)}.

Setting up sparklyr

I am working on setting up sparklyr utilizing R but I keep getting an error message. I essentially have this type in:
install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.1.0")
sc <- spark_connect(master = "local")
However when I get to create my spark connect I am receiving the following error message:
Using Spark: 2.1.0
Error in if (a[k] > b[k]) return(1) else if (a[k] < b[k]) return(-1L) :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: running command '"C:\WINDOWS\SYSTEM32\java.exe" -version' had status 2
2: In compareVersion(parsedVersion, "1.7") : NAs introduced by coercion
Any thoughts?

Not able to to convert R data frame to Spark DataFrame

When I try to convert my local dataframe in R to Spark DataFrame using:
raw.data <- as.DataFrame(sc,raw.data)
I get this error:
17/01/24 08:02:04 WARN RBackendHandler: cannot find matching method class org.apache.spark.sql.api.r.SQLUtils.getJavaSparkContext. Candidates are:
17/01/24 08:02:04 WARN RBackendHandler: getJavaSparkContext(class org.apache.spark.sql.SQLContext)
17/01/24 08:02:04 ERROR RBackendHandler: getJavaSparkContext on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
The question is similar to
sparkR on AWS: Unable to load native-hadoop library and
Don't need to use sc if you are using the latest version of Spark. I am using SparkR package having version 2.0.0 in RStudio. Please go through following code (that is used to connect R session with SparkR session):
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "path-to-spark home/spark-2.0.0-bin-hadoop2.7")
}
library(SparkR)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R","lib")))
sparkR.session(enableHiveSupport = FALSE,master = "spark://master url:7077", sparkConfig = list(spark.driver.memory = "2g"))
Following is the output of R console:
> data<-as.data.frame(iris)
> class(data)
[1] "data.frame"
> data.df<-as.DataFrame(data)
> class(data.df)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
use this example code :
library(SparkR)
library(readr)
sc <- sparkR.init(appName = "data")
sqlContext <- sparkRSQL.init(sc)
old_df<-read_csv("/home/mx/data.csv")
old_df<-data.frame(old_df)
new_df <- createDataFrame( sqlContext, old_df)

How to initialize a new Spark Context and executors number on YARN from RStudio

I am working with SparkR.
I am able to set Spark Context on YARN with desired number of executors and executor-cores with such command:
spark/bin/sparkR --master yarn-client --num-executors 5 --executor-cores 5
Now I am trying to initialize a new Spark Context but from RStudio which is more comfortable to work with than a regular command line.
I figured out that to do this I'll need to use sparkR.init() function. There is an option master which I set to yarn-client but how to specify num-executors or executor-cores? This is where I stacked
library(SparkR, lib.loc = "spark-1.5.0-bin-hadoop2.4/R/lib")
sc <- sparkR.init(sparkHome = "spark-1.5.0-bin-hadoop2.4/",
master = "yarn-client")
Providing sparkEnvir argument for sparkR.init should work:
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
sc <- sparkR.init(
sparkHome = "spark-1.5.0-bin-hadoop2.4/",
master = "yarn-client",
sparkEnvir = sparkEnvir)

Resources