How to use Jupyter + SparkR and custom R install - r

I am using a Dockerized image and Jupyter notebook along with SparkR kernel. When I create a SparkR notebook, it uses an install of Microsoft R (3.3.2) instead of vanilla CRAN R install (3.2.3).
The Docker image I'm using installs some custom R libraries and Python pacakages but I don't explicitly install Microsoft R. Regardless of whether or not I can remove Microsoft R or have it side-by-side, how I can get my SparkR kernel to use a custom installation of R?
Thanks in advance

Docker-related issues aside, the settings for Jupyter kernels are configured in files named kernel.json, residing in specific directories (one per kernel) which can be seen using the command jupyter kernelspec list; for example, here is the case in my (Linux) machine:
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
Again, as an example, here are the contents of the kernel.json for my R kernel (ir)
{
"argv": ["/usr/lib64/R/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
"display_name": "R 3.3.2",
"language": "R"
}
And here is the respective file for my pyspark2 kernel:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
As you can see, in both cases the first element of argv is the executable for the respective language - in my case, GNU R for my ir kernel and Intel Python 2.7 for my pyspark2 kernel. Changing this, so that it points to your GNU R executable, should resolve your issue.

To use a custom R environment I believe you need to set the following application properties when you start Spark:
"spark.r.command": "/custom/path/bin/R",
"spark.r.driver.command": "/custom/path/bin/Rscript",
"spark.r.shell.command" : "/custom/path/bin/R"
This is more completely documented here: https://spark.apache.org/docs/latest/configuration.html#sparkr

Related

Change Default kernel in jupyter notebook

I am using ipython 6.4.0 on ubuntu 20.04 and using jupyter kernelspec list , I found, there are 2 kernels :
practice_applied_ai
python3
When I open any .ipynb file, it directly opens in "python3" but I want to open it in "practice_applied_ai" because I created virtual environment practice_applied_ai and only in this kernel I can import Tensorflow 2.2.0 for my work.
My question is, Is there any way to change my default kernel without removing any kernel ?
jupyter notebook --generate-config
open the generated config file change
change this line to your desired kernel
#c.MultiKernelManager.default_kernel_name = 'python3'
like
c.MultiKernelManager.default_kernel_name = 'py38'
See this answer on GitHub.
As explained there:
the default kernel name is rarely used. It really only comes into play when a request is received to start a kernel and the kernel name is not specified in the request payload. Since both Notebook and Lab UIs essentially require the user to select a kernel (for new notebooks), it doesn't really come into play.
Put c.MappingKernelManager.default_kernel_name='newDefault' in config file.
To confirm the default is in place, hit the kernelspecs REST API of your running notebook server (e.g., http://localhost:8888/api/kernelspecs) and you should see the default kernel name as the first entry in the returned payload.
Yes this is possible via the .ipynb file itself.
Set the following variables in the metadata, specifically the name which identifies the kernel
"metadata": {
"kernelspec": {
"display_name": "Python 3 (PyTorch 1.6 Python 3.6 CPU Optimized)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/pytorch-1.6-cpu-py36-ubuntu16.04-v1"
},
"language_info": {
"codemirror_mode": {
Not sure how to change the default kernel, but you can change the kernel being used in a notebook, after opening the notebook, as explained in this site.
Open the notebook. Then navigate to Kernel -> Change Kernel and select the kernel you want to use.

mlflow R installation MLFLOW_PYTHON_BIN

I am trying to install mlflow in R and im getting this error message saying
mlflow::install_mlflow()
Error in mlflow_conda_bin() :
Unable to find conda binary. Is Anaconda installed?
If you are not using conda, you can set the environment variable MLFLOW_PYTHON_BIN to the path of yourpython executable.
I have tried the following
export MLFLOW_PYTHON_BIN="/usr/bin/python"
source ~/.bashrc
echo $MLFLOW_PYTHON_BIN -> this prints the /usr/bin/python.
or in R,
sys.setenv(MLFLOW_PYTHON_BIN="/usr/bin/python")
sys.getenv() -> prints MLFLOW_PYTHON_BIN is set to /usr/bin/python.
however, it still does not work
I do not want to use conda environment.
how to I get past this error?
The install_mlflow command only works with conda right now, sorry about the confusing message. You can either:
install conda - this is the recommended way of installing and using mlflow
or
install mlflow python package yourself via pip
To install mlflow yourself, pip install correct (matching the the R package) python version of mlflow and set the MLFLOW_PYTHON_BIN environment variable as well as MLFLOW_BIN evn variable: e.g.
library(mlflow)
system(paste("pip install -U mlflow==", mlflow:::mlflow_version(), sep=""))
Sys.setenv(MLFLOW_BIN=system("which mlflow"))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python"))
Just ran across this, and the accepted answer by #Tomas was very helpful. I added a comment above but, for some additional context, I wanted to create a more thorough response if any other Enterprise Databricks R users run across this post trying to use the MLflow package for R on Databricks.
The Databricks MLflow quickstart guide will tell you that you need to run the following:
library(mlflow)
install_mlflow()
However, for Enterprise Databricks users, the install_mlflow() function will fail if your cluster doesn't have outside connectivity privileges (as most probably don't) and can't connect to the Anaconda repo to download the necessary packages. You'll likely get an error like this:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/conda-forge/linux-64/current_repodata.js
The good news is that MLflow should already be installed on your Databricks runtime. So you can reference that install instead, and then as #Tomas mentioned, use it to set your R environment variables for MLFLOW_BIN and MLFLOW_PYTHON_BIN. From there, the R MLflow API works as specified (in my experience, but ymmv).
The only catch from the above solution is that when you use the system()function in R, you need to set intern=TRUE in order capture the output of the command. The default behavior of the system() function is intern=FALSE. Thus if you do not explicitly set intern=TRUE, then the exit code 0 will be returned from your system() call (or perhaps another exit code upon an error) and Sys.setenv() will set the environment variable to 0!
### intern=True missing ###
Sys.setenv(MLFLOW_BIN=system("which mlflow"))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python"))
Example output (you can see the the environment variables did not get set correctly):
s <- Sys.getenv()
s[grep("MLFLOW", names(s))]
MLFLOW_BIN 0
MLFLOW_CONDA_HOME /databricks/conda
MLFLOW_PYTHON_BIN 0
MLFLOW_PYTHON_EXECUTABLE
/databricks/python/bin/python
MLFLOW_TRACKING_URI databricks
However, when intern=TRUE, you'll get the correct environment variables:
### intern=True set ###
Sys.setenv(MLFLOW_BIN=system("which mlflow", intern=TRUE))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python", intern=TRUE))
Example output:
s <- Sys.getenv()
s[grep("MLFLOW", names(s))]
MLFLOW_BIN /databricks/python3/bin/mlflow
MLFLOW_CONDA_HOME /databricks/conda
MLFLOW_PYTHON_BIN /databricks/python3/bin/python
MLFLOW_PYTHON_EXECUTABLE
/databricks/python/bin/python
MLFLOW_TRACKING_URI databricks
Note: This was using Databricks runtime 9.1 LTS ML. This may or may not work on other Databricks runtime configurations.

Azure Machine Learning integration of R: Should the 'azureml' module have an attribute 'core'?

I'm having issues with Azure Machine Learning SDK for R: "module 'azureml' has no attribute 'core'"...
For reasons that aren't my own, I have to use azureml to apply machine learning (my own stuff, written in R) to data from our data warehouse that is put in the blob storage. The modelled output should be put back into the blob storage so it can be accessed from the data warehouse.
I've written the code in R on my local machine (stored in a git repo). Preferably, I'd find some method to pull my code from git into a pipeline in the azureml environment so that it can be directly run whenever new data is available in the blob storage.
I've embarked on a tutorial-spree and found this seemingly relevant walkthrough: Train and deploy your first model with Azure ML (and this one).
But... after trying all I could think of, I'm stuck on the first steps. After installing all (or at least.. that's what I think) packages, modules, apps etc, and running the following code in RStudio:
library(azuremlsdk)
existing_ws <- get_workspace(name = name,
subscription_id = subscription_id,
resource_group)
I run into an error that I haven't been able to fix:
AttributeError: module 'azureml' has no attribute 'core'
It seems that the azuerml is supposed to have an attribute "core", but when looking at it more closely, there is indeed no such attribute.
The function "get_workspace()" is trying to access: "azureml$core$Workspace$get".
I found that "azuerML$Workspace" does exist, but then I can't figure out how to make that work.
Can anyone explain to me why I'm encountering this error?
Does anyone know of a better tutorial on how to connect my R code the azureml's cloud service?
Any pointers in the right direction are much appreciated!
EDITS - still not solved:
After advice from others, I double, triple and quadruple checked the installation.
I updated R and I'm now running:
R.version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 6.2
year 2019
month 12
day 12
svn rev 77560
language R
version.string R version 3.6.2 (2019-12-12)
nickname Dark and Stormy Night
I installed Conda with Python 3.6.10.
I installed the azuremlsdk R package (I tried both provided options).
I then realized that there are some inconsistencies with the versions of the azure-modules, so I also tried installing it with the keyword 'multi-arch':
remotes::install_cran('azuremlsdk', repos = 'http://cran.us.r-project.org', INSTALL_opts=c("--no-multiarch"))
Then, I installed the azureml python sdk.
I had a look at all the versions again (using python -m pip freeze):
azure-common==1.1.24
azure-graphrbac==0.61.1
azure-mgmt-authorization==0.60.0
azure-mgmt-containerregistry==2.8.0
azure-mgmt-keyvault==2.0.0
azure-mgmt-resource==7.0.0
azure-mgmt-storage==7.1.0
azureml==0.2.7
azureml-automl-core==1.0.83.1
azureml-core==1.0.69
azureml-dataprep==1.1.36
azureml-dataprep-native==13.2.0
azureml-pipeline==1.0.69
azureml-pipeline-core==1.0.69
azureml-pipeline-steps==1.0.69
azureml-sdk==1.0.69
azureml-telemetry==1.0.69
azureml-train==1.0.69
azureml-train-automl-client==1.0.83
azureml-train-core==1.0.69
azureml-train-restclients-hyperdrive==1.0.69
As I was surprised to see all the 1.0.69 versions, instead of the 1.0.83 versions, I re-installed the azureml python sdk using:
azuremlsdk::install_azureml(version = "1.0.83")
This worked, in the sense that indeed all versions are now 1.0.83:
azure-common==1.1.24
azure-graphrbac==0.61.1
azure-mgmt-authorization==0.60.0
azure-mgmt-containerregistry==2.8.0
azure-mgmt-keyvault==2.0.0
azure-mgmt-resource==7.0.0
azure-mgmt-storage==7.1.0
azureml==0.2.7
azureml-automl-core==1.0.83.1
azureml-core==1.0.83
azureml-dataprep==1.1.36
azureml-dataprep-native==13.2.0
azureml-pipeline==1.0.83
azureml-pipeline-core==1.0.83
azureml-pipeline-steps==1.0.83
azureml-sdk==1.0.83
azureml-telemetry==1.0.83
azureml-train==1.0.83
azureml-train-automl-client==1.0.83
azureml-train-core==1.0.83
azureml-train-restclients-hyperdrive==1.0.83
But still... I get the error with the missing core. I get it both when running:
library(azuremlsdk)
get_current_run()
and also when running:
library(azuremlsdk)
existing_ws <- get_workspace(name = name,
subscription_id = subscription_id,
resource_group)
Note that the first time running this code after starting up RStudio, I get the error:
Error in py_get_attr_impl(x, name, silent) :
AttributeError: module 'azureml' has no attribute '_base_sdk_common'
And every time after that I get this error:
Error in py_get_attr_impl(x, name, silent) :
AttributeError: module 'azureml' has no attribute 'core'
Any help would be much appreciated!
This issue was introduced by the latest reticulate 1.14 release, in which reticulate would create a default r-reticulate conda environment. Since Azure ML was installing the python SDK in an environment named r-azureml, the r-reticulate environment used by reticulate was missing the python SDK. A fix for this issue was addressed in a PR and has been merged into master. Please install from GitHub for now if you have reticulate version 1.14 and are running into this issue. We will be releasing an update to CRAN shortly.
I seemed to have fixed the issue by specifically installing the python package azureml AND azureml.core:
python -m pip install azureml
and then...
python -m pip install azureml.core
I did this for the Conda version that was called by R (r-reticulate). It's a bit odd to not be able to use the Conda environment 'r-azureml' without R switching back to 'r-reticulate', but ah well... at least I don't get my 'azureml' has no attribute 'core' anymore.

Lexical error when connecting R with Jupyter Notebook by IRkernel::installspec()

I want to connect R with Jupyter Notebook, and I have installed IRkernel. However, when I run the code IRkernel::installspec(), an error occurs:
Error: lexical error: invalid char in json text.
C:\Users\xin_chen\AppData\Local
(right here) ------^
I had the same problem on one of my machines. I ended up manually copying the kernelspec files from another PC. The path is:
C:\Users\_username_\AppData\Roaming\jupyter\kernels\ir
If you don't have a working installation, you can copy the files from
https://github.com/IRkernel/IRkernel/tree/master/inst/kernelspec.
You might need to edit the kernel.json file to set the path of your R executable. For example:
{
"argv": ["C:/Program Files/Microsoft/R Client/R_SERVER/bin/x64/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
"display_name": "R",
"language": "R"
}

Sublime Text 2 and R

I am trying to use Sublime Text 2 as an interface to the statistics software R [update/edit: Solved!].
On Windows, I have tried the following:
Installed R Tools. Turned out to be for Macintosh 64 only.
Tried to program custom build file. Failed: no output returned.
{
"cmd": "C:/Program Files/R/R-2.9.2/bin/R.exe --no-save $File"
}
Installed SublimeREPL. Failed: R menu option disabled...
[update/edit] Tried this (see wuub's reply):
{
"default_extend_env": {"PATH": "{PATH};C:\\Program Files\\R\\R-2.9.2\\bin"}
}
Open SublimeREPL's user settings like this: Preferences -> Package Settings -> SublimeREPL -> Settings - User
there set default_extend_path to point to your R installation:
{
"default_extend_env": {"PATH": "{PATH};C:\\Program Files\\R\\R-2.14.2\\bin\\i386"}
}
Running Tools -> SublimeREPL -> R should launch REPL as expected.
For windows system you can add a new build for R (Tools -> Build System -> New Build System) and put the following lines. Modify the path according to your R installation directory.
{"cmd": ["Rscript.exe", "$file"],
"path": "C:\\Program Files\\R\\R-2.15.2\\bin\\x64\\",
"selector": "source.r"}
You can execute the entire file by pressing ctrl+B instead of starting SublimeREPL.
The build system value needs to be an array so you want;
{
"cmd": ["C:/Program Files/R/R-2.9.2/bin/R.exe", "--no-save", "$File"]
}
https://docs.sublimetext.io/guide/usage/build-systems.html#file-format
For linux users, take a look:
Sublime Text 2 R build System
It supports multi selection and interactive R sessions.

Resources