Error in R h2o.predict with xgboost -> java.lang.NullPointerException - r

First of all thanks for implementing XGBoost in h2o!
Unfortunately I am unable to predict from an h2o xgboost model that's loaded from disk (which I'm sure you can appreciate is really frustrating).
I am using the latest stable release of h2o i.e. 3.10.5.2 & I am using an R client.
I have included an example below that should enable you to reproduce the issue,
Thanks in advance
### Start h2o
require(h2o)
local_h2o = h2o.init()
### Source the base data set
data(mtcars)
h2o_mtcars = as.h2o(x = mtcars,destination_frame = 'h2o_mtcars')
### Fit a model to be saved
mdl_to_save = h2o.xgboost(model_id = 'mdl_to_save',y = 1,x = 2:11,training_frame = h2o_mtcars) ##This class doesnt work
#mdl_to_save = h2o.glm(model_id = 'mdl_to_save',y = 1,x = 2:11,training_frame = h2o_mtcars) ##This class works
### Take some reference predictions
ref_preds = h2o.predict(object = mdl_to_save,newdata = h2o_mtcars)
### Save the model to disk
silent = h2o.saveModel(object = mdl_to_save,path = 'INSERT_PATH',force = TRUE)
### Delete the model to make sure there cant be any strange locking issues
h2o.rm(ids = 'mdl_to_save')
### Load it back up
loaded_mdl = h2o.loadModel(path = 'INSERT_PATH/mdl_to_save')
### Score the model
### The h2o.predict statement below is what causes the error: java.lang.NullPointerException
lod_preds = h2o.predict(object = loaded_mdl,newdata = h2o_mtcars)
all.equal(ref_preds,lod_preds)

At the time I write this (January 2018), this is still a bug for xgboost. See this ticket for more information.
In the meantime, you can download the model as a pojo or mojo file
h2o.download_pojo(model, path = "/media/somewhere/tmp")
Loading the model back isn't that easy, unfortunately, but you can pass the new data via json to the saved pojo model with the function:
h2o.predict_json()
However, the new data must be provided in json format.
See this question for more details

Related

Writing a prediction equation from plsr model

Greeting to everyone.
I sucessfully computed pls-r model in R using the code below
pls_modB_Kexch_2 <- plsr(Av.K_exc~., data = trainKexch.sar.veg, scale=TRUE,method= "s",validation='CV')
The regression coeffiecents for ncomps =11 were
(
Intercept)= -4.692966e+05,
Easting = 6.068582e+03, Northings= 7.929767e+02,
sigma_vv = 8.024741e+05, sigma_vh = -6.375260e+05,
gamma_vv = -7.120684e+05, gamma_vh = 4.330279e+05,
beta_vv = -8.949598e+04, beta_vh = 2.045924e+05,
c11_db = 2.305016e+01, c22_db = -4.706773e+01,
c12_real = -1.877267e+00.)
It predicts well new data sets when applied with in R enviroment.
My challenge is presenting this model in form of y=sum(AX)+Bo equation where A are coeffiecents of respective variablesX
Or any other mathmetical form, that can be presented academically.
I tried a direct way by multiplying the coeff.to each variable and suming them up, aquick manual trial for predictions gave me strange results. Am missing something here, please help.

R targets with H2O

I use targets as a pipelining tool for an ML project with H2O.
The main uniqueness of using H2O here is that it creates a new "cluster" (basically a new local process/server which communicates via Rest APIs as far as I understand).
The issue I am having is two-fold.
How can I stop/operate the cluster within the targets framework in a smart way
How can I save & load the data/models within the targets framework
MWE
A minimum working example I came up with looks like this (being the _targets.R file):
library(targets)
library(h2o)
# start h20 cluster once _targets.R gets evaluated
h2o.init(nthreads = 2, max_mem_size = "2G", port = 54322, name = "TESTCLUSTER")
create_dataset_h2o <- function() {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
# convert the data to h2o dataframe
as.h2o(iris)
}
train_model <- function(hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.randomForest(x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
y = c("Species"),
training_frame = hex_data,
model_id = "our.rf",
seed = 1234)
}
predict_model <- function(model, hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.predict(model, newdata = hex_data)
}
list(
tar_target(data, create_dataset_h2o()),
tar_target(model, train_model(data), format = "qs"),
tar_target(predict, predict_model(model, data), format = "qs")
)
This kinda works but faces the two issues I was outlying above and below...
Ad 1 - stopping the cluster
Usually I would out a h2o::h2o.shutdown(prompt = FALSE) at the end of my script, but this does not work in this case.
Alternatively, I came up with a new target that is always run.
# in _targets.R in the final list
tar_target(END, h2o.shutdown(prompt = FALSE), cue = tar_cue(mode = "always"))
This works when I run tar_make() but not when I use tar_visnetwork().
Another option is to use.
# after the h2o.init(...) call inside _targets.R
on.exit(h2o.shutdown(prompt = FALSE), add = TRUE)
Another alternative that I came up with is to handle the server outside of targets and only connect to it. But I feel that this might break the targets workflow...
Do you have any other idea how to handle this?
Ad 2 - saving the dataset and model
The code in the MWE does not save the data for the targets model and predict in the correct format (format = "qs"). Sometimes (I think when the cluster gets restarted or so), the data gets "invalidated" and h2o throws an error. The data in h2o format in the R session is a pointer to the h2o dataframe (see also docs).
For keras, which similarly stores the models outside of R, there is the option format = "keras", which calls keras::save_model_hdf5() behind the scenes. Similarly, H2O would require h2o::h2o.exportFile() and h2o::h2o.importFile() for the dataset and h2o::h2o.saveModel() and h2o::h2o.loadModel() for models (see also docs).
Is there a way to create additional formats for tar_targets or do I need to write the data to file, and return the file? The downside to this is that this file is outside of the _targets folder system, if I am not mistaken.
Ad 1
I would recommend handling the H2O cluster outside the pipeline in a separate script. That way, tar_visnetwork() would not start or stop the cluster, and you could more cleanly separate the software engineering from the data analysis.
# run_pipeline.R
start_h2o_cluster(port = ...)
on.exit(stop_h2o_cluster(port = ...))
targets::tar_make_clustermq(workers = 4)
Ad 2
It sounds like H2O objects are not exportable. Currently, you would need to save those files manually, identify the paths, and write format = "file" in tar_target(). I am willing to consider H20-based formats. Are all objects in some way covered by h2o.exportFile(), h2o.importFile(), h2o::h2o.saveModel(), and h2o::h2o.loadModel(), or are there more kinds of objects with different serialization functions? And does h2o have utilities to perform this (un)serialization in memory like serialize_model()/unserialize_model() in keras?

R package to download CMIP6 data

I want to download CMIP6 data from here. Package 'epwshiftr' has a nice function 'init_cmip6_index' to index all the relevant information including the download URLs.
library("epwshiftr")
#this indexes all the information about the models
test = init_cmip6_index(activity = "CMIP",
variable = 'pr',
frequency = 'mon',
experiment = c("historical"),
source = NULL,
variant = NULL,replica = F,
latest = T,
limit = 10000L,data_node = NULL)
#to print the unique model
unique(test$source_id)
This gives me list of 18 models
There are many other models (for instance 'HAMMOZ-Consortium.MPI-ESM-1-2-HAM' and 'IPSL.IPSL-CM6A-LR') that have data as per my specification but I do not see them in query.
Is there a way to make this package work (or update the package) to include all the files which are available for download?

How to code: if loading of a file/model is possible, skip running it again

How to complement the code in a way that if loading of a file/model is possible, then skip running and saving the model again.
load("model.RData")
model = brm(bf(pt_overall_outpatient ~ predictor), data = data)
save(model, file = "model.Rdata")

Error in impute() in R

I'm learning Random Forest. For learning purpose I'm using following link random Forest. I'm trying to run the code given in this link using my R-3.4.1.
But while running the following code for missing value treatment
mp2 <- impute(data = test,target = "target",classes =
list(integer=imputeMedian(), factor=imputeMode()))
I'm getting error message Error in impute(data = test, target = "target", classes = list(integer = imputeMedian(), :
unused argument (data = test)
I modified the code & try running this
imp2 <- impute(test,target = "target",classes = list(integer=imputeMedian(), factor=imputeMode()))
Still I'm getting the error but the error message is different. Can you please help me to solve this issue?
The key mistake (among many mistakes) in that code was that there is no data parameter. The parameter name is obj. When I change that the example code runs.
You also need to set on= or setkey given that the object is a data.table, or simply change it to a data.frame for the imputation step:
imp1 <- impute(obj = as.data.frame(train),target = "target",classes = list(integer=imputeMedian(), factor=imputeMode()))

Resources