Thanks in advance for your input. I am a newbie to ML.
I've developed a R model (using R studio on my local) and want to deploy on the hadoop cluster having R Studio installed. I want to use SparkR to leverage high performance computing. I just want to understand the role of SparkR here.
Will SparkR enable the R model to run the algorithm within Spark ML on the Hadoop Cluster?
OR
Will SparkR enable only the data processing and still the ML algorithm will run within the context of R on the Hadoop Cluster?
Appreciate your input.
These are general questions, but they actually have a very simple & straightforward answer: no (to both); SparkR wiil do neither.
From the Overview section of the SparkR docs:
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.
SparkR cannot even read native R models.
The idea behind using SparkR for ML tasks is that you develop your model specifically in SparkR (and if you try, you'll also discover that it is much more limited in comparison to the plethora of models available in R through the various packages).
Even conveniences like, say, confusionMatrix from the caret package, are not available, since they operate on R dataframes and not on Spark ones (see this question & answer).
Related
A little background to understand my problem:
My company uses a private server to put our ML models into production using opencpu. The ML models, which were generated using Caret, are usually written into an R package, that does data preprocessing before passing to the model. The R package, along with opencpu, and the R package dependencies are compiled using Docker into a Docker container, that is then deployed onto the server. I don't understand the deployment process, but that's not my job. My job is to come up with the ML models, create the R package and make sure it works (in our test environment) before it goes to production.
Recently I developed a model using Keras/Tensorflow in R, and want to test this model within our test environment (which is mimics the production environment). This means that I need to include the keras/tensorflow model inside an R package, similar to the Caret version.
I want to know how I can do this without having to install Keras as a dependency. The only point of using Keras is to load the model using the load_model_hdf5 function, with the prediction being done by the base R predict function. Personally I think it is overkill to install such a large package (Keras in R installs Tensorflow, conda and a python environment as well) just to load a model.
This page (https://cran.r-project.org/web/packages/tfdeploy/vignettes/introduction.html) describes methods for deploying Tensorflow models in R, but they only discuss using Rstudio connect, CLoudML and TensorFlow Serving (but this uses gRPC).
I need to limit the number of threads running my neural network using these instructions here: https://github.com/keras-team/keras/issues/4740.
However, I am using keras in R, and I am not sure how do I access the tensorflow implementation used in keras I load in R using
library("keras")
I can call library(tensorflow), however, isn't it loading a library copy unrelated to the one loaded by keras? And I cannot find any functionality in R that allows to load tensorflow backend associated with keras in Rstudio. Also I cannot find any links to anyone doing the same.
Can someone suggest a way to do the operations in the link from R, given keras loaded with library("keras") (in the link tensorflow backend for keras is used to set the number of threads per core). It would also be good to know how to check which version is loaded into R by keras.
I want to create connection between R and QlikView using 'opencpu' package R.
I've seen some examples but I did not understand how to use the opencpu R package to create the connection between R and QlikView.
With its version 3.1 release, Qlik engine will be able to pass data in and out of both R and Python, including analysis context about a data set
I am learning h2o package now,
I installed h2o package from CRAN and couln't run this code
## To import small iris data file from H\ :sub:`2`\ O's package
irisPath = system.file("extdata", "iris.csv", package="h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")
I am getting the below error,
Error in h2o.importFile(localH2O, path = irisPath, key = "iris.hex") :
unused argument (key = "iris.hex")
My second question is, Do we have good resources to learn h2o in R apart from this:
http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/Ruser/rtutorial.html
My third question is I want to know how the h2o works in simple words.?
The reason this code no longer works is that there was a breaking API change from H2O 2.0 to H2O 3.0 back in 2015. The docs you have discovered (probably via a Google search) are for a very old version of H2O 2.0. The up-to-date docs can always be found at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
Answering your error question:
H2O changed a bit from this documentation. Reading the iris file works as follows:
iris.hex = h2o.importFile(path = irisPath, destination_frame = "iris.hex")
Your second (and third question) is against SO rules. But below you will find a short list of resources:
H2O training materials (go to the h2o.ai
website) and go to general documentation. You can find all the
material there presented on h2o world 2015. There is also a link to
h2o university.
Check their blog. There are some gold nuggets in there.
Read the booklets they have available on GBM, GLM, Deep Learning. They contain examples in R and Python.
Kaggle. Search the scripts / kernels for h2o.
As for your third question, read their "Why H2O pages".
To answer your question about how H2O works it is little hard to put together here. however in nutshell, H2O is an open source enterprise ready machine intelligence engine with accessibility from popular machine learning languages i.e. R, Python as well as programming languages Java and Scala. Enterprise ready means users can distribute execution to multiple machines depending on extremely large size of data. The Java based core engine has builtin multiple algorithms implementation and any language interface goes through interpreter to H2O core engine which could be a distributed cluster to build models and score results. There is a lot in between so I would suggest visiting link below to learn more about H2O architecture and execution from various supported language:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html
You can dig out more on H2O implementation in R starting from installation to implementation of h2o machine learning library in R. Go through this link.
This also helps you in order to implement h2o machine learning on top of SparkR framework.
If you want to get an idea of h2o working prototype from very basic than follow this link. It provides the basic flavor of working prototype with code walk-through (quick learning tutorial).
Apart from above points, it also covers the following key points:
How to convert H2O data frame to R and Spark data frame and vice-versa
What are the pros and cons between SparkMLlib and H2O machine library
What are the strengths of h2o compare to other ML library
How to apply ML algorithm to R and Spark data frame etc.
I was using H2o R package (2014 version) to perform a deep learning task using textual data. I did my research in early 2015 and obtained promising results using deep learning method (function - h2o.deeplearning; e.g. fscore and recall always achieve >0.9). I found that my original R code doesn't work now (due to the change of H2o package in Nov 2015) and i revised my code. However, when i tried to run the same deep learning model (same setting), I could not achieve an outperfom results anymore!! please, I wish to know if H2o has changed any internal modeling settings since the revision of the H2o package? I wish to reproduce my old results with the new package... please kindly help.
H2O Deep Learning (2.0 and 3.0) is not reproducible by default -- you can change this by setting reproducible = TRUE, however that will slow things down quite a bit, as reproducibility requires the code to be run on a single core. Therefore the variability could be due to the randomness in the algorithm alone, rather than from the upgrade of H2O 2.0 to 3.0.
If you want to use H2O Classic (2.0), then your old code will still work, as is. You might try running that first to see if you can track down the source of the variability. There is nothing wrong with using H2O Classic to finish a project that you started a while ago.
Implementation details for H2O 3.0 Deep Learning are available in the Deep Learning booklet.
There is more information on what has changed in H2O DL between H2O 2.0 and 3.0 here.