R targets with H2O - r

I use targets as a pipelining tool for an ML project with H2O.
The main uniqueness of using H2O here is that it creates a new "cluster" (basically a new local process/server which communicates via Rest APIs as far as I understand).
The issue I am having is two-fold.
How can I stop/operate the cluster within the targets framework in a smart way
How can I save & load the data/models within the targets framework
MWE
A minimum working example I came up with looks like this (being the _targets.R file):
library(targets)
library(h2o)
# start h20 cluster once _targets.R gets evaluated
h2o.init(nthreads = 2, max_mem_size = "2G", port = 54322, name = "TESTCLUSTER")
create_dataset_h2o <- function() {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
# convert the data to h2o dataframe
as.h2o(iris)
}
train_model <- function(hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.randomForest(x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
y = c("Species"),
training_frame = hex_data,
model_id = "our.rf",
seed = 1234)
}
predict_model <- function(model, hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.predict(model, newdata = hex_data)
}
list(
tar_target(data, create_dataset_h2o()),
tar_target(model, train_model(data), format = "qs"),
tar_target(predict, predict_model(model, data), format = "qs")
)
This kinda works but faces the two issues I was outlying above and below...
Ad 1 - stopping the cluster
Usually I would out a h2o::h2o.shutdown(prompt = FALSE) at the end of my script, but this does not work in this case.
Alternatively, I came up with a new target that is always run.
# in _targets.R in the final list
tar_target(END, h2o.shutdown(prompt = FALSE), cue = tar_cue(mode = "always"))
This works when I run tar_make() but not when I use tar_visnetwork().
Another option is to use.
# after the h2o.init(...) call inside _targets.R
on.exit(h2o.shutdown(prompt = FALSE), add = TRUE)
Another alternative that I came up with is to handle the server outside of targets and only connect to it. But I feel that this might break the targets workflow...
Do you have any other idea how to handle this?
Ad 2 - saving the dataset and model
The code in the MWE does not save the data for the targets model and predict in the correct format (format = "qs"). Sometimes (I think when the cluster gets restarted or so), the data gets "invalidated" and h2o throws an error. The data in h2o format in the R session is a pointer to the h2o dataframe (see also docs).
For keras, which similarly stores the models outside of R, there is the option format = "keras", which calls keras::save_model_hdf5() behind the scenes. Similarly, H2O would require h2o::h2o.exportFile() and h2o::h2o.importFile() for the dataset and h2o::h2o.saveModel() and h2o::h2o.loadModel() for models (see also docs).
Is there a way to create additional formats for tar_targets or do I need to write the data to file, and return the file? The downside to this is that this file is outside of the _targets folder system, if I am not mistaken.

Ad 1
I would recommend handling the H2O cluster outside the pipeline in a separate script. That way, tar_visnetwork() would not start or stop the cluster, and you could more cleanly separate the software engineering from the data analysis.
# run_pipeline.R
start_h2o_cluster(port = ...)
on.exit(stop_h2o_cluster(port = ...))
targets::tar_make_clustermq(workers = 4)
Ad 2
It sounds like H2O objects are not exportable. Currently, you would need to save those files manually, identify the paths, and write format = "file" in tar_target(). I am willing to consider H20-based formats. Are all objects in some way covered by h2o.exportFile(), h2o.importFile(), h2o::h2o.saveModel(), and h2o::h2o.loadModel(), or are there more kinds of objects with different serialization functions? And does h2o have utilities to perform this (un)serialization in memory like serialize_model()/unserialize_model() in keras?

Related

Temporary objects in parallel computing in R

I have a pretty long R code which needs to be iterated several hundred times. I am using a 32 core and 32 GB RAM cloud service to do the job. To make the code run faster, I want to use parallel computing using foreach() command. I have set the codes working with no errors. However, I need to make sure if I am getting proper results. To illustrate my point I have set a simplified mock code:
foreach (i = 1:100) %dopar% {
age <- seq(from=20,to=79, by=1)
d <- as.data.frame(age)
d$gender <- rbinom(nrow(d),size = 1,prob = 0.5)
d$prob <- cut(d$age, breaks = c(20,30,40,50,60,70,80), include.lowest = T,right = F,labels = c(.001,.01,.1,.25,.3,.1))
d$prob <- as.numeric(as.character(d$prob))
d$event <- rbinom(nrow(d),size = 1,prob = d$prob)
save(d,file = paste("d_",i,".rda", sep = ""))
table(d$gender,d$event)
}
I am wondering if temporary objects, like ā€œdā€ in this example, is independent for each cluster when running this code. If there is only one object ā€œdā€ in the memory which is shared by different clusters, what is the solution for an independent object.
For reference, I am using the code proposed by this page (https://github.com/tobigithub/R-parallel) to make clusters.
Thanks in advance for your reply.

Error in R h2o.predict with xgboost -> java.lang.NullPointerException

First of all thanks for implementing XGBoost in h2o!
Unfortunately I am unable to predict from an h2o xgboost model that's loaded from disk (which I'm sure you can appreciate is really frustrating).
I am using the latest stable release of h2o i.e. 3.10.5.2 & I am using an R client.
I have included an example below that should enable you to reproduce the issue,
Thanks in advance
### Start h2o
require(h2o)
local_h2o = h2o.init()
### Source the base data set
data(mtcars)
h2o_mtcars = as.h2o(x = mtcars,destination_frame = 'h2o_mtcars')
### Fit a model to be saved
mdl_to_save = h2o.xgboost(model_id = 'mdl_to_save',y = 1,x = 2:11,training_frame = h2o_mtcars) ##This class doesnt work
#mdl_to_save = h2o.glm(model_id = 'mdl_to_save',y = 1,x = 2:11,training_frame = h2o_mtcars) ##This class works
### Take some reference predictions
ref_preds = h2o.predict(object = mdl_to_save,newdata = h2o_mtcars)
### Save the model to disk
silent = h2o.saveModel(object = mdl_to_save,path = 'INSERT_PATH',force = TRUE)
### Delete the model to make sure there cant be any strange locking issues
h2o.rm(ids = 'mdl_to_save')
### Load it back up
loaded_mdl = h2o.loadModel(path = 'INSERT_PATH/mdl_to_save')
### Score the model
### The h2o.predict statement below is what causes the error: java.lang.NullPointerException
lod_preds = h2o.predict(object = loaded_mdl,newdata = h2o_mtcars)
all.equal(ref_preds,lod_preds)
At the time I write this (January 2018), this is still a bug for xgboost. See this ticket for more information.
In the meantime, you can download the model as a pojo or mojo file
h2o.download_pojo(model, path = "/media/somewhere/tmp")
Loading the model back isn't that easy, unfortunately, but you can pass the new data via json to the saved pojo model with the function:
h2o.predict_json()
However, the new data must be provided in json format.
See this question for more details

H2O: Deep learning object not found in function 'predict' for argument 'model'

I'm just testing out h2o, in particular its deep learning capabilities, since I've heard great things about it. So far I've been using the following code:
library(h2o)
library(caret)
data("iris")
# Initiate H2O --------------------
h2o.removeAll() # Clean up. Just in case H2O was already running
h2o.init(nthreads = -1, max_mem_size="22G") # Start an H2O cluster with all threads available
# Get training and tournament data -------------------
a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]
# Convert target to factor -------------------
target <- as.factor(iris$Species)
feature_names <- names(train)[1:(ncol(train)-1)]
train_h2o <- as.h2o(train)
test_h2o <- as.h2o(test)
prob <- test[, "id", drop = FALSE]
model_dl <- h2o.deeplearning(x = feature_names, y = "target", training_frame = train_h2o, stopping_metric = "logloss")
h2o.logloss(model_dl)
pred_dl <- predict(model_dl, newdata = tourn_h2o)
prob <- cbind(prob, as.data.frame(pred_dl$p1, col.names = "dl"))
write.table(prob[, c("id", "dl")], paste0(model_dl#model_id, ".csv"), sep = ",", row.names = FALSE, col.names = c("id", "probability"))
The relevant part is really that last line, where I got the following error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Object 'DeepLearning_model_R_1494350691427_70' not found in function: predict for argument: model
Has anyone come across this before? Are there any easy solutions to this that I might be missing? Thanks in advance.
EDIT: With the updated code I get the error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Illegal argument(s) for DeepLearning model: DeepLearning_model_R_1494428751150_1. Details: ERRR on field: _train: Training data must have at least 2 features (incl. response).
ERRR on field: _stopping_metric: Stopping metric cannot be logloss for regression.
I assume this has to do with the the way the Iris dataset is being read in.
Answer To First Question: Your original error message sounds like one you can get when things get of sync. E.g. maybe you had two sessions running at once, and removed the model in one session; the other session wouldn't know its variables are now out of date. H2O allows multiple connections, but they have to be co-operative. (Flow - see next paragraph - counts as a second session.)
Unless you can make a reproducible example, shrug and put it down to gremlins, and start a new session. Or, go and look at the data/models in Flow (a web server always running on 127.0.0.1:54321 ), and see if something is no longer there.
For your EDIT question, your model is making a regression model, but you are trying to use logloss, so thought you were doing a classification. This is caused by not having set the target variable to be a factor. Your current as.factor() line is on the wrong data, in the wrong place. It should go after your as.h2o() lines:
train_h2o <- as.h2o(training) #Typo fix
test_h2o <- as.h2o(test)
feature_names <- names(training)[1:(ncol(training)-1)] #typo fix
y = "Species" #The column we want to predict
train_h2o[,y] <- as.factor(train_h2o[,y])
test_h2o[,y] <- as.factor(test_h2o[,y])
And then make the model with:
model_dl <- h2o.deeplearning(x = feature_names, y = y, training_frame = train_h2o, stopping_metric = "logloss")
Get predictions with:
pred_dl <- predict(model_dl, newdata = test_h2o) #Typo fix
And compare with correct answer with the prediction using:
cbind(test[, y], as.data.frame(pred_dl$predict))
(BTW, H2O always detects the Iris data set columns as numeric vs. factor perfectly, so the above as.factor() lines are not needed; your error message must've been on your original data.)
StackOverflow advice: test your reproducible example, in full, and copy and paste in that exact code, with the exact error message that code is giving you. Your code had numerous small typos. E.g. train in places, training in others. createDataPartition() was not given; I assumed a = sample(nrow(iris), 0.8*nrow(iris)). test has no "id" column.
Other H2O advice:
Run h2o.removeAll() after h2o.init(). It was giving you an error message if run before. (Personally I avoid that function - it is the kind of thing that gets left in a production script by mistake...)
Consider importing your data into h2o earlier, and using h2o.splitFrame() to split it. I.e. avoid doing things in R that H2O can easily handle.
Avoid having your data in R, at all, if you can. Prefer importFile() over as.h2o().
The thinking beyond both the last points is that H2O will scale beyond the memory of one machine, while R won't. It also is less confusing than trying to keep track of the same thing in two places.
I had the same issue but could resolve it quite easily.
My error occured because I read in an h2o-object before initialising the h2o-cluster. So I trained an h2o-model, saved it, shut down the cluster, loaded in the model and then initialized the cluster once again.
Before reading in the h2o-object, you should already initialize the cluster (h2o.init()).

Consisten results with Multiple runs of h2o deeplearning

For a certain combination of parameters in the deeplearning function of h2o, I get different results each time I run it.
args <- list(list(hidden = c(200,200,200),
loss = "CrossEntropy",
hidden_dropout_ratio = c(0.1, 0.1,0.1),
activation = "RectifierWithDropout",
epochs = EPOCHS))
run <- function(extra_params) {
model <- do.call(h2o.deeplearning,
modifyList(list(x = columns, y = c("Response"),
validation_frame = validation, distribution = "multinomial",
l1 = 1e-5,balance_classes = TRUE,
training_frame = training), extra_params))
}
model <- lapply(args, run)
What would I need to do in order to get consistent results for the model each time I run this?
Deeplearning with H2O will not be reproducible if it is run on more than a single core. The results and performance metrics may vary slightly from what you see each time you train the deep learning model. The implementation in H2O uses a technique called "Hogwild!" which increases the speed of training at the cost of reproducibility on multiple cores.
So if you want reproducible results you will need to restrict H2O to run on a single core and make sure to use a seed in the h2o.deeplearning call.
Edit based on comment by Darren Cook:
I forgot to include the reproducible = TRUE parameter that needs to be set in combination with the seed to make it truly reproducible. Note that this will make it a lot slower to run. And is is not advisable to do this with a large dataset.
More information on "Hogwild!"

run h2o algorithms inside a foreach loop?

I naively thought it's straight forward to make multiple calls to h2o.gbm in parallel inside a foreach loop. But got a strange error.
Error in { :
task 3 failed - "java.lang.AssertionError: Can't unlock: Not locked!"
Codes below
library(foreach)
library(doParallel)
library(doSNOW)
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
h2o.init(ip="localhost", nthreads=2, max_mem_size = "5G")
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
}
h2o.shutdown(prompt=FALSE)
return(iname)
}
stopCluster(cl)
NOTE: This unlikely good use of R's parallel foreach, but I'll answer your question first, then explain why. (BTW when I use "cluster" in this answer I'm referring to an H2O cluster (even if is just on your local machine), and not an R "cluster".)
I've re-written your code, assuming the intention was to have a single H2O cluster, where all the models are to be made:
library(foreach)
library(doParallel)
library(doSNOW)
library(h2o)
h2o.init(ip="localhost", nthreads=-1, max_mem_size = "5G")
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
#TODO: do something with bm2 here?
}
return(iname) #???
}
stopCluster(cl)
I.e. in outline form:
Start H2O, and load Xtr and Xval into it
Start 6 threads in your R client
In each thread make 3 GBM models (one after each other)
I dropped the h2o.shutdown() command, guessing that you didn't intend that (when you shutdown the H2O cluster the models you just made get deleted). And I've highlighted where you might want to be doing something with your model. And I've given H2O all the threads on your machine (that is the nthreads=-1 in h2o.init()), not just 2.
You can make H2O models in parallel, but it is generally a bad idea, as they end up fighting for resources. Better to do them one at a time, and rely on H2O's own parallel code to spread the computation over the cluster. (When the cluster is a single machine this tends to be very efficient.)
By the fact you've gone to the trouble of making a parallel loop in R, makes me think you've missed the way H2O works: it is a server written in Java, and R is just a light client that sends it API calls. The GBM calculations are not done in R; they are all done in Java code.
The other way to interpret your code is to run multiple instances of H2O, i.e. multiple H2O clusters. This might be a good idea if you have a set of machines, and you know the H2O algorithm is not scaling very well across a multi-node cluster. Doing it on a single machine is almost certainly a bad idea. But, for the sake of argument, this is how you do it (untested):
library(foreach)
library(doParallel)
library(doSNOW)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
library(h2o)
h2o.init(ip="localhost", port = 54321 + (i*2), nthreads=2, max_mem_size = "5G")
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
#TODO: save bm2 here
}
h2o.shutdown(prompt=FALSE)
return(iname) #???
}
stopCluster(cl)
Now the outline is:
Create 6 R threads
In each thread, start an H2O cluster that is running on localhost but on a port unique to that cluster. (The i*2 is because each H2O cluster is actually using two ports.)
Upload your data to the H2O cluster (i.e. this will be repeated 6 times, once for each cluster).
Make 3 GBM models, one after each other.
Do something with those models
Kill the cluster for the current thread.
If you have 12+ threads on your machine, and 30+ GB memory, and the data is relatively small, this will be roughly as efficient as using one H2O cluster and making 12 GBM models in serial. If not, I believe it will be worse. (But, if you have pre-started 6 H2O clusters on 6 remote machines, this might be a useful approach - I must admit I'd been wondering how to do this, and using the parallel library for it had never occurred to me until I saw your question!)
NOTE: as of the current version (3.10.0.6), I know the above code won't work, as there is a bug in h2o.init() that effectively means it is ignoring the port. (Workarounds: either pre-start all 6 H2O clusters on the commandline, or set the port in an environment variable.)

Resources