Trouble getting H2O to work with Sparklyr - r

I am trying to get H2O working with Sparklyr on my spark cluster (yarn)
spark_version(sc) = 2.4.4
My spark cluster is running V2.4.4
According to this page the compatible version with my spark is 2.4.5 for Sparkling Water and the H2O release is rel-xu patch version 3. However when I install this version I am prompted to update my H2O install to the next release (REL-ZORN). Between the H2O guides and the sparklyr guides it's very confusing and contradictory at times.
Since this is a yarn deployment and not local, unfortunately I can't provide a repex to help with trobleshooting.
url <- "http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.4/5/sparkling-water-2.4.5.zip"
download.file(url = url,"sparkling-water-2.4.5.zip")
unzip("sparkling-water-2.4.5.zip")
# RUN THESE CMDs FROM THE TERMINAL
cd sparkling-water-2.4.5
bin/sparkling-shell --conf "spark.executor.memory=1g"
# RUN THESE FROM WITHIN RSTUDIO
install.packages("sparklyr")
library(sparklyr)
# REMOVE PRIOR INSTALLS OF H2O
detach("package:rsparkling", unload = TRUE)
if ("package:h2o" %in% search()) { detach("package:h2o", unload = TRUE) }
if (isNamespaceLoaded("h2o")){ unloadNamespace("h2o") }
remove.packages("h2o")
# INSTALLING REL-ZORN (3.36.0.3) WHICH IS REQUIRED FOR SPARKLING WATER 3.36.0.3
install.packages("h2o", type = "source", repos = "https://h2o-release.s3.amazonaws.com/h2o/rel-zorn/3/R")
# INSTALLING FROM S3 SINCE CRAN NO LONGER SUPPORTED
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-2.4/3.36.0.3-1-2.4/R")
# AS PER THE GUIDE
options(rsparkling.sparklingwater.version = "2.4.5")
library(rsparkling)
# SPECIFY THE CONFIGURATION
config <- sparklyr::spark_config()
config[["spark.yarn.queue"]] <- "my_data_science_queue"
config[["sparklyr.backend.timeout"]] <- 36000
config[["spark.executor.cores"]] <- 32
config[["spark.driver.cores"]] <- 32
config[["spark.executor.memory"]] <- "40g"
config[["spark.executor.instances"]] <- 8
config[["sparklyr.shell.driver-memory"]] <- "16g"
config[["spark.default.parallelism"]] <- "8"
config[["spark.rpc.message.maxSize"]] <- "256"
# MAKE A SPARK CONNECTION
sc <- sparklyr::spark_connect(
master = "yarn",
spark_home = "/opt/mapr/spark/spark",
config = config,
log = "console",
version = "2.4.4"
)
When I try to establish a H2O context using the next chunk I get the following error
h2o_context(sc)
Error in h2o_context(sc) : could not find function "h2o_context"
Any pointers as to where I'm going wrong would be greatly appreciated.

See this tutorial please. The newer versions of Rsparkling use {H2OContext.getOrCreate(h2oConf)} instead of {h2o_context(sc)}.

Related

"It is a distutils installed project ..." when calling install_mlflow()

When install_mlflow() is called to install mlflow for R, the following error is encountered.
Attempting uninstall: certifi
Found existing installation: certifi 2018.4.16
ERROR: Cannot uninstall 'certifi'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
Note: The above is using miniconda installed using install_miniconda() command.
P.S. Posting question & answer for everyone's benefit (I spend 2 days on this).
Root cause:
The function install_mlflow()calls reticulate::conda_install() function with the default value for pip_ignore_installed which turns out to be FALSE.
💡 Hint: Click any function in the script while holding down cmd key to view the source code.
Workaround:
You can work around this issue by calling the function with pip_ignore_installed = TRUE. I've recreated the install_mlflow() function in the below script for convenience.
The script also checks and installs miniconda if not installed.
library(reticulate)
library(mlflow)
# Installing minicoda if not installed
if (!dir.exists(miniconda_path()))
install_miniconda(path = miniconda_path(), update = TRUE, force = TRUE)
# install_mlflow() # This doesn't work so we use the alt fn below.
install_mlflow_alt <- function() {
mlflow_version <- utils::packageVersion("mlflow")
packages <- c(paste("mlflow", "==", mlflow_version, sep = ""))
# Geting mlflow conda bin
conda_home <- Sys.getenv("MLFLOW_CONDA_HOME", NA)
conda <- if (!is.na(conda_home)) {
paste(conda_home, "bin", "conda", sep = "/")
} else {
"auto"
}
conda_try <- try(conda_binary(conda = conda), silent = TRUE)
if (class(conda_try) == "try-error") {
msg <- paste(attributes(conda_try)$condition$message,
paste(" If you are not using conda, you can set the environment variable",
"MLFLOW_PYTHON_BIN to the path of your python executable."),
sep = "\n")
stop(msg)
}
conda <- conda_try
# Installing mlflow
mlflow_conda_env_name <- paste("r-mlflow", mlflow_version, sep = "-")
conda_install(packages, envname = mlflow_conda_env_name,
pip = TRUE, conda = conda, pip_ignore_installed = TRUE)
}
# NOTE: Run the following command in terminal (use pip3 for python 3)
# before calling the install_mlflow_alt() function below
# paste("pip install -U mlflow==", mlflow:::mlflow_version(), sep="")
install_mlflow_alt()

How old is an installed R package?

Is it possible to get the year that an installed R package is released using some R code? I can get the version, but then have to look it up on the internet, when this version was released.
Background: I am working for the Swiss Federal Statistical Office and a small group is trying to get a better R environment (we are working for example with the dplyr version 0.7.4 from 2017... and it is not possible to install a newer version...).
Cheers
Renger
You can use versions package to get a timestamp of package version. The package pulls the published versions of the package from the MRAN snapshot server.
versions::installed.versions("dplyr")
# [1] "1.0.7"
versions::available.versions("dplyr")
# $dplyr
# version date available
# 1 1.0.7 2021-06-18 TRUE
# 2 1.0.6 2021-05-05 TRUE
# 3 1.0.5 2021-03-05 TRUE
# ...
Package age
So if you want to answer the specific question about the package age you can do the following:
how_old <- function(pkg, lib = .libPaths()[1], return_age = FALSE) {
pkg_ver <- versions::installed.versions(pkgs = pkg, lib = lib)
av_vers <- versions::available.versions(pkgs = pkg)
pkg_dte <- subset.data.frame(
x = as.data.frame(unname(av_vers)),
subset = version == pkg_ver,
select = date,
drop = TRUE
)
pkg_dte <- as.Date(pkg_dte)
if (return_age) {
return(epocakir::dob2age(dob = pkg_dte))
} else {
return(pkg_dte)
}
}
how_old("dplyr", return_age = TRUE)
Results
[1] "1123200s (~1.86 weeks)"
Package creation
Or if you want to find out when package was installed locally.
when_created <- function(pkg, lib = .libPaths()[1]) {
# Package will always have DESCRIPTION file so that's a safe bet
desc_file <- system.file("DESCRIPTION", package = pkg, lib.loc = lib)
info <- fs::file_info(desc_file)
info$birth_time
}
when_created("dplyr")
Results
# [1] "2021-06-25 08:47:21 BST"
As #Jonathan recommended, if the package has a citation, then you can call the year in the citation.
citation("dplyr")$year
An alternative is to get the date from a list of available versions of a package.
devtools::install_github("https://github.com/cran/versions")
library(versions)
x <- versions::available.versions(c("dplyr", "ggplot2"))
version_year <-
function(x,
package.name = "",
version = "") {
pckg <- x[[package.name]]
row <- which(pckg$version == version)
return(pckg$date[row])
}
version_year(x, "ggplot2", version = "2.0.0")
#[1] "2015-12-18"
As a last resort, you can find out when a package was created from its DESCRIPTION:
packageDescription(pkg)$Packaged
In fact, citation falls back to this very field if no other date was given (either as Date/Publication or via an explicit CITATION file).

How to make sparkR run

I have been trying to make SparkR work without success.
Read previous questions, blogs, and yet haven't been able to make it work.
First I had issues installing SparkR, finally I think I installed it, but then cannot make it run.
Here is my detailed code with different Options I tried to make it run.
Currently using Rstudio with R 3.6.0 version.
Any help will be appreciated!!
#***************************#
#Installing Spark Option 1
#***************************#
install.packages("SparkR")
'''
Does not work
'''
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
#***************************#
#Installing Spark Option 2
#***************************#
#Find Spark Versions
jsonlite::fromJSON("https://api.github.com/repos/apache/spark/tags")$name
if (!require('devtools')) install.packages('devtools')
devtools::install_github('apache/spark#v2.4.6', subdir='R/pkg')
Sys.setenv(SPARK_HOME='D:/spark-2.3.1-bin-hadoop2.7')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
'''
Installation didnt work
'''
#***************************#
#Installation Spark Option 3
#***************************#
install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.3.1")
install.packages("https://cran.r-project.org/src/contrib/Archive/SparkR/SparkR_2.3.0.tar.gz", repos = NULL, type="source")
library(SparkR)
'''
One of 2 installations worked
'''
#***************************#
#Starting Spark Option 1
#***************************#
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R","lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
'''
Spark package found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Launching java with spark-submit command D:/spark-2.3.1-bin-hadoop2.7/bin/spark-submit2.cmd --driver-memory "2g" sparkr-shell C:\Users\FELIPE~1\AppData\Local\Temp\RtmpKOxYkx\backend_port34a0263f43f5
Error in if (len > 0) { : argumento tiene longitud cero
'''
#***************************#
#Starting Spark Option 2
#***************************#
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
# initializing Spark context
sc <- sparkR.init(sparkHome = "'D:/spark-2.3.1-bin-hadoop2.7'",
sparkEnvir = sparkEnvir)
'''
Error in sparkR.sparkContext(master, appName, sparkHome, convertNamedListToEnv(sparkEnvir), :
JVM is not ready after 10 seconds
Además: Warning message:
sparkR.init is deprecated.
Use sparkR.session instead.
See help("Deprecated")
'''
#***************************#
#Starting Spark Option 3
#***************************#
Sys.setenv("JAVA_HOME" = "D:/Program Files/Java/jdk1.8.0_181")
Sys.getenv("JAVA_HOME")
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
# initializing Spark context
sc <- sparkR.session(sparkHome = "'D:/spark-2.3.1-bin-hadoop2.7'",
sparkEnvir = sparkEnvir)
'''
Spark not found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Spark package found in SPARK_HOME: D:/spark-2.3.1-bin-hadoop2.7
Launching java with spark-submit command D:/spark-2.3.1-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\FELIPE~1\AppData\Local\Temp\RtmpKOxYkx\backend_port34a082b15e1
Error in if (len > 0) { : argumento tiene longitud cero
'''

how to install a package from github?

I need to install a github package
This is my code:
devtools :: install_github ("jgalgarra / kcorebip")
I already have the devtool installed but it gives me the following error:
Error: Failed to install 'unknown package' from GitHub:
Timeout was reached: Connection timed out after 10015 milliseconds
This is what I have configured in my Rprofile.site:
# Things you might want to change
# options (papersize = "a4")
# options (editor = "notepad")
# options (pager = "internal")
# set the default help type
# options (help_type = "text")
options (help_type = "html")
# set a site library
# .Library.site <- file.path (chartr ("\\", "/", R.home ()), "site-library")
# set a CRAN mirror
# local ({r <- getOption ("repos")
# r ["CRAN"] <- "http: //my.local.cran"
# r ["CRAN"] <- http://cran.us.r-project.org
#options (repos = r)})
local ({r <- getOption ("repos")
r ["Nexus"] <- "http://nexus.uo.edu.cu:8081/repository/R-repository/"
options (repos = r)
})
# Give a fortune cookie, but only to interactive sessions
# (This would need the fortunes package to be installed.)
# if (interactive ())
# fortunes :: fortune ()
Can I modify any line to install the github or is it another problem?

How to initialize a new Spark Context and executors number on YARN from RStudio

I am working with SparkR.
I am able to set Spark Context on YARN with desired number of executors and executor-cores with such command:
spark/bin/sparkR --master yarn-client --num-executors 5 --executor-cores 5
Now I am trying to initialize a new Spark Context but from RStudio which is more comfortable to work with than a regular command line.
I figured out that to do this I'll need to use sparkR.init() function. There is an option master which I set to yarn-client but how to specify num-executors or executor-cores? This is where I stacked
library(SparkR, lib.loc = "spark-1.5.0-bin-hadoop2.4/R/lib")
sc <- sparkR.init(sparkHome = "spark-1.5.0-bin-hadoop2.4/",
master = "yarn-client")
Providing sparkEnvir argument for sparkR.init should work:
sparkEnvir <- list(spark.num.executors='5', spark.executor.cores='5')
sc <- sparkR.init(
sparkHome = "spark-1.5.0-bin-hadoop2.4/",
master = "yarn-client",
sparkEnvir = sparkEnvir)

Resources