Security concerns with the H2O R package - r

I am using the H2O R package.
My understanding is, that this package requires you to have an internet connection as well as connect to the the h2o servers? If you use the h2o package run machine learning models on your data, does h2o "see" your data? I turned off my wifi and tried running some machine learning models using h2o :
data(iris)
library(h2o)
h2o.init()
iris_hf <- as.h2o(iris)
iris_dl <- h2o.deeplearning(x = 1:4, y = 5, training_frame = iris_hf, seed=123456)
predictions <- h2o.predict(iris_dl, iris_hf)
This seems to work, but could someone please confirm? If you do not want anyone to see your data, is it still a good idea to use the "h2o" library? Since the code above runs without an internet connection, I am not sure about this.

From the documentation of h2o.init() (emphasis mine):
This method first checks if H2O is connectible. If it cannot connect and startH2O = TRUE with IP of localhost, it will attempt to start an instance of H2O with IP = localhost, port = 54321. Otherwise, it stops immediately with an error. When initializing H2O locally, this method searches for h2o.jar in the R library resources [...], and if the file does not exist, it will automatically attempt to download the correct version from Amazon S3. The user must have Internet access for this process to be successful. Once connected, the method checks to see if the local H2O R package version matches the version of H2O running on the server. If there is a mismatch and the user indicates she wishes to upgrade, it will remove the local H2O R package and download/install the H2O R package from the server.
So, h2o.init() with the default setting ip = "127.0.0.1", as here, connects the R session with the H2O instance (sometimes referred to as "server") in your local machine. If all the necessary package files are in place and up to date, no internet connection is necessary; the package will attempt to connect to the internet only to download stuff in case something is not present or up to date. No data is uploaded anywhere.

Related

how to load an RDBMS driver for h2o in a Jupyter notebook?

I'd like to create a self-contained Jupyter notebook that uses h2o to import and model data that resides in a relational database. The docs show an example where h2o is launched with the JDBC driver in the classpath, e.g.
java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp
I'd prefer to start h2o from a notebook that's a standalone, reproducible artifact rather than have special steps to prep the environment prior to running the notebook. If I run the following snippet:
import h2o
h2o.init()
connection_url = "jdbc:mysql://mysql.woolford.io/mydb"
select_query = "SELECT description, price FROM mytable"
username = "myuser"
password = "b#dp#ss"
mytable_data = h2o.import_sql_select(connection_url, select_query, username, password)
... the import_sql_select method fails because the driver isn't loaded:
Server error java.lang.RuntimeException:
Error: SQLException: No suitable driver found for jdbc:mysql://mysql.woolford.io/mydb
Is there a way to load the driver when the h2o.init() call is made? Or a best practice for this?
h2o.init() takes a parameter called extra_classpath. You can use this parameter to provide the path to the JDBC driver and H2O will launch with the driver.
This option is designed exactly for the purpose of not having to start H2O outside of the notebook interface.
Example:
import h2o
h2o.init(extra_classpath=["/Users/michal/Downloads/apache-hive-2.2.0-bin/jdbc/hive-jdbc-2.2.0-standalone.jar"])

How to prevent h2o cluster shutdown without notice using R

I am a h2o R version user and I have a question regarding the h2o local cluster. I setup the cluster by execute the command in r,
h2o.init()
However, the cluster will be turned off automatically when I do not use it for a few hours. For example, I run my model during the night, but when I come back to my office in the morning to check on my model. It says,
Error in h2o.getConnection() : No active connection to an H2O cluster. Did you runh2o.init()?
Is there a way to fix or work around it ?
If the H2O cluster is still running, then your models are all still there (assuming they finished training successfully). There are a number of ways that you can check if the H2O Java cluster is still running. In R, you can check the output of these functions:
h2o.clusterStatus()
h2o.clusterInfo()
At the command line (look for a Java process):
ps aux | grep java
If you started H2O from R, then you should see a line that looks something like this:
yourusername 26215 0.0 2.7 8353760 454128 ?? S 9:41PM 21:25.33 /usr/bin/java -ea -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/h2o/java/h2o.jar water.H2OApp -name H2O_started_from_R_me_iqv833 -ip localhost -port 54321 -ice_root /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T//Rtmp6XG99X
H2O models do not live in the R environment, they live in the H2O cluster (a Java process). It sounds like what's happening is that the R object representing your model (which is actually just a pointer to the model in the H2O cluster) is having issues finding the model since your cluster disconnected. I don't know exactly what's going on because you haven't posted the errors you're receiving when you try to use h2o.predict() or h2o.performance().
To get the model back, you can use the h2o.getModel() function. You will need to know the ID of your model. If your model object (that's not working properly) is still accessible, then you can see the model ID easily that way: model#model_id You can also head over to H2O Flow in the browser (by typing: http://127.0.0.1:54321 if you started H2O with the defaults) and view all the models by ID that way.
Once you know the model ID, then refresh the model by doing:
model <- h2o.getModel("model_id")
This should re-establish the connection to your model and the h2o.predict() and h2o.performance() functions should work again.

how to read data from Cassandra (DBeaver) to R

I am using Cassandra CQL- system in DBeaver database tool. I want to connect this cassandra to R to read data. Unfortunately the connection takes more time (i waited for more than 2 hours) with RCassandra package. but it does not seem to get connected at all and still loading. Does anyone has any idea on this?
the code as follows:
library(RCassandra)
rc <- RC.connect(host ="********", port = 9042)
RC.login(rc, username = "*****", password = "******")
after this step RC.login, it is still loading for more than 2 hours.
I have also tried using RJDBC package like posted here : How to read data from Cassandra with R?.
library(RJDBC)
drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",
list.files("C:/Program Files/DBeaver/jre/lib",
pattern="jar$",full.names=T))
But this throws error
Error in .jfindClass(as.character(driverClass)[1]) : class not found
None of the answers are working for me from the above link.I am using latest R version 3.4.0 (2017-04-21) and New version of DBeaver : 4.0.4.
For your first approach, which I am less familiar with, should you not have a line that sets the use of the connection?
such as:
library(RCassandra)
c <- RC.connect(host ="52.0.15.195", port = 9042)
RC.login(c, username = "*****", password = "******")
RC.use(c, "some_db")
Did you check logs that you are not getting some silent error while connecting?
For your second approach, your R program is not seeing a driver in a classpath for Java (JMV).
See this entry for help how to fix it.

Error while using h2o.init in R

This is the error message:
> h2o.init()
Error in dirname(path) : path too long
In addition: There were 12 warnings (use warnings() to see them)
This is one of the warning messages (the others are similar):
> warnings()
Warning messages:
1: In normalizePath(path.expand(path), winslash, mustWork) :
path[1]="\\FILE-EM1-06/USERDATA2$/john134/My Documents/./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..": The filename or extension is too long
Any idea how to work around this error?
Thanks
It seems that Windows path string is limited to (maybe) 256 length. Usually, setting a the path setwd(shorterExistingWorkDir) works and should address your issue.
I struggled with this issue quite a bit, including upgrading.
Most folks are assuming that you've literally just set an incredibly long path. I don't think this is the case (it wasn't for me, at least). It's that the PATH may be set on a network drive or other device where the underlying mapped paths are more complicated.
A related thread is here on the H2O forum:
Main issue is the user had a Windows drive that did not conform to the norm, i.e., "C://", etc. Instead, the user had a network drive
(DTCHYB-AZPX015/). This caused issues in the search for a config
file as there was no "root" (In this case, "root" is reaching your Win
drive). Since there was no "root", the path to search kept expanding
until it caused R to error out with the above exception.
The fix is to NOT search for a config when h2o.init() is called. Rather, only search for a config if a user asks to do so. My proposal
is to add a new field to h2o.init()called ignore_config. This
field will be set to TRUE by default.
When calling h2o.init() the R environment signal the launching of h2o application (actually a web server) in the backend which was installed when you install H2O package into R. The local runtime environment uses the full path of the location where H2O jar file is located. Because the packages is installed deep inside the nested folders in your file system it cross the valid limit of OS path 256 character length and fails to launch the backend H2O server and you see this error. In your case you are using external path so adds up more characters in the path to make the problem worse..
For example the h2o.jar is located in my OSX machine as below:
/Library/Frameworks/R.framework/Resources/library/h2o <-- H2O package Path
/Library/Frameworks/R.framework/Resources/library/h2o/java/h2o.jar <-- Jar Path
As you are using Windows, what you need is to find ways to reduce this path to OS limit and it will work.
The other solution is to run h2o.jar separately and then just use R to connect to H2O cluster. The steps are as below:
Download H2O 3.10.4.2 and unzip to a folder close to root so you do not hit 265 char limit again. Also install 3.10.4.2 R Package. (Try to keep the same version)
Run H2O > java -jar h2o.jar
From RStudio console try > h2o.init()
So if there is already H2O cluster running the h2o.init() will connect to a running H2O cluster instead to start one and you will by pass above problem.
If you hit any problem write here and we will help you.

Error when calling h2o.init() in R to connecting to remote cluster

I'm trying to connect to the cluster hosted on EC2 machine from R and getting the same error when trying both on Windows and Mac:
> h2o.init(ip = "<Public IP>")
Connection successful!
ERROR: Unexpected HTTP Status code: 404 Not Found (url = http://<Public IP>:54321/3/Cloud?skip_ticks=true)
Error: lexical error: invalid char in json text.
<!DOCTYPE html> <html lang="en"
(right here) ------^
Cluster is reachable at http://<Public IP>:54321/
Starting a local cluster with h2o.init() also works fine in R, so the problem is only when trying to connect to remote one.
I've seen the following issue marked as resolved, but it doesn't help in my case. Have anybody experienced anything similar?
UPD: The answer was very simple. It turns out that the code example given in their guide for EC2 is outdated and uses the old version of H2O. Using the most recent version (3.9.1.5555 at the moment) on EC2 machines has resolved the issue.
To elaborate on the OP's update, when using a remote cluster:
Make sure you install the most recent version (check the S3 download page for the redirect to the release number). In the example below, this is 3.13.0.3908:
wget http://s3.amazonaws.com/h2o-release/h2o/master/3908/h2o-3.13.0.3908.zip
unzip h2o-3.13.0.3908.zip
mv h2o-3.13.0.3908 h2o
cd h2o
java -Xmx4g -jar h2o.jar
You then need to install the version of h2o-R that corresponds to this version. (The correct version is likely not the CRAN version.) Otherwise you will get an error like:
Error in h2o.init(ip = "XXX.XX.XX.XXX", startH2O = FALSE) :
Version mismatch! H2O is running version 3.13.0.3908 but h2o-R package is version 3.10.4.6.
Install the matching h2o-R version from - http://h2o-release.s3.amazonaws.com/h2o/master/3908/index.html
So you need to note the version number H2O is running (in the above example, 3908), make sure you have previously removed any existing h2o-R package (see here for more info), and then do:
install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/master/3908/R")
Now it should work:
library('h2o')
remoteH2O <- h2o.init(ip='XXX.XX.XX.XXX', startH2O=FALSE) # Connection successful!

Resources