H2O: Cannot read LARGE model from disk via `h2o.loadModel` - r

UPDATED 28Jun2017, below, in response to #Michal Kurka.
UPDATED 26Jun2017, below.
I am unable to load a large GBM model that I saved in native H2O format (ie, hex).
H2O v3.10.5.1
R v3.3.2
Linux 3.10.0-327.el7.x86_64 GNU/Linux
My goal is to eventually save this model as MOJO.
This model was so large that I had to initialize H2O with min/max memory 100G/200G before H2O's model training would run successfully.
This is how I trained the GBM model:
localH2O <- h2o.init(ip = 'localhost', port = port, nthreads = -1,
min_mem_size = '100G', max_mem_size = '200G')
iret <- h2o.gbm(x = predictors, y = response, training_frame = train.hex,
validation_frame = holdout.hex, distribution="multinomial",
ntrees = 3000, learn_rate = 0.01, max_depth = 5, nbins = numCats,
model_id = basename_model)
gbm <- h2o.getModel(basename_model)
oPath <- h2o.saveModel(gbm, path = './', force = TRUE)
The training data contains 81,886 records with 1413 columns. Of these columns, 19 are factors. The vast majority of these columns are 0/1.
$ wc -l training/*.txt
81887 training/train.txt
27294 training/holdout.txt
This is the saved model as written to disk:
$ ls -l
total 37G
-rw-rw-r-- 1 bfo7328 37G Jun 22 19:57 my_model.hex
This is how I tried to read the model from disk using the same large memory allocation values 100G/200G:
$ R
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
> library(h2o)
> localH2O=h2o.init(ip='localhost', port=65432, nthreads=-1,
min_mem_size='100G', max_mem_size='200G')
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out
/tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.err
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
Starting H2O JVM and connecting: .. Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 seconds 550 milliseconds
H2O cluster version: 3.10.5.1
H2O cluster version age: 13 days
H2O cluster name: H2O_started_from_R_bfo7328_kmt050
H2O cluster total nodes: 1
H2O cluster total memory: 177.78 GB
H2O cluster total cores: 64
H2O cluster allowed cores: 64
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 65432
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)
From /tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out:
INFO: Processed H2O arguments: [-name, H2O_started_from_R_bfo7328_kmt050, -ip, localhost, -port, 65432, -ice_root, /tmp/RtmpVSwxXR]
INFO: Java availableProcessors: 64
INFO: Java heap totalMemory: 95.83 GB
INFO: Java heap maxMemory: 177.78 GB
INFO: Java version: Java 1.8.0_121 (from Oracle Corporation)
INFO: JVM launch parameters: [-Xms100G, -Xmx200G, -ea]
INFO: OS version: Linux 3.10.0-327.el7.x86_64 (amd64)
INFO: Machine physical memory: 1.476 TB
My call to h2o.loadModel:
if ( TRUE ) {
now <- format(Sys.time(), "%a %b %d %Y %X")
cat( sprintf( 'Begin %s\n', now ))
model_filename <- './my_model.hex'
in_model.hex <- h2o.loadModel( model_filename )
now <- format(Sys.time(), "%a %b %d %Y %X")
cat( sprintf( 'End %s\n', now ))
}
From /tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out:
INFO: GET /, parms: {}
INFO: GET /, parms: {}
INFO: GET /, parms: {}
INFO: GET /3/InitID, parms: {}
INFO: Locking cloud to new members, because water.api.schemas3.InitIDV3
INFO: POST /99/Models.bin/, parms: {dir=./my_model.hex}
After waiting an hour, I see these "out of memory" (OOM) error messages:
INFO: POST /99/Models.bin/, parms: {dir=./my_model.hex}
#e Thread WARN: Swapping! GC CALLBACK, (K/V:24.86 GB + POJO:112.01 GB + FREE:40.90 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping! GC CALLBACK, (K/V:26.31 GB + POJO:118.41 GB + FREE:33.06 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping! GC CALLBACK, (K/V:27.36 GB + POJO:123.03 GB + FREE:27.39 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping! GC CALLBACK, (K/V:28.21 GB + POJO:126.73 GB + FREE:22.83 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
I would not expect to need so much memory to read the model from disk.
How can I read this model from disk into memory. And once I do, can I save it as a MOJO?
UPDATE 1: 26Jun2017
I just noticed that the disk size of a GBM model increased dramatically between versions of H2O:
H2O v3.10.2.1:
-rw-rw-r-- 1 169M Jun 19 07:23 my_model.hex
H2O v3.10.5.1:
-rw-rw-r-- 1 37G Jun 22 19:57 my_model.hex
Any ideas why? Could this be the root of the problem?
UPDATE 2: 28Jun2017 in response to comments by #Michal Kurka.
When I load the training data via fread, the class (type) of each column is correct:
* 24 columns are ‘character’;
* 1389 columns are ‘integer’ (all but one column are 0/1);
* 1413 total columns.
I then convert the R-native data frame to an H2O data frame and manually factor-ize 20 columns:
train.hex <- as.h2o(df.train, destination_frame = "train.hex”)
length(factorThese)
[1] 20
train.hex[factorThese] <- as.factor(train.hex[factorThese])
str(train.hex)
A condensed version of the output from str(train.hex), showing only those 19 columns that are factors (1 factor is the response column):
- attr(*, "nrow")= int 81886
- attr(*, "ncol")= int 1413
- attr(*, "types")=List of 1413
..$ : chr "enum" : Factor w/ 72 levels
..$ : chr "enum" : Factor w/ 77 levels
..$ : chr "enum" : Factor w/ 51 levels
..$ : chr "enum" : Factor w/ 4226 levels
..$ : chr "enum" : Factor w/ 4183 levels
..$ : chr "enum" : Factor w/ 3854 levels
..$ : chr "enum" : Factor w/ 3194 levels
..$ : chr "enum" : Factor w/ 735 levels
..$ : chr "enum" : Factor w/ 133 levels
..$ : chr "enum" : Factor w/ 16 levels
..$ : chr "enum" : Factor w/ 25 levels
..$ : chr "enum" : Factor w/ 647 levels
..$ : chr "enum" : Factor w/ 715 levels
..$ : chr "enum" : Factor w/ 679 levels
..$ : chr "enum" : Factor w/ 477 levels
..$ : chr "enum" : Factor w/ 645 levels
..$ : chr "enum" : Factor w/ 719 levels
..$ : chr "enum" : Factor w/ 678 levels
..$ : chr "enum" : Factor w/ 478 levels
The above results are exactly the same between v3.10.2.1 (smaller model written to disk: 169M) and v3.10.5.1 (larger model written to disk: 37G).
The actual GBM training uses nbins <- 37:
numCats <- n_distinct(as.matrix(dplyr::select_(df.train,response)))
numCats
[1] 37
iret <- h2o.gbm(x = predictors, y = response, training_frame = train.hex,
validation_frame = holdout.hex, distribution="multinomial",
ntrees = 3000, learn_rate = 0.01, max_depth = 5, nbins = numCats,
model_id = basename_model)

The difference in size of the models (169M vs 37G) is surprising. Can you please make sure that H2O recognizes all your numeric columns as numeric and not categorical with very high cardinality?
Do you use automatic detection of column types or do you specify it manually?

Related

Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector

Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
occurs when trying to run my script in r.
I have tried to find the solution for it, but it seems to be pretty specific, and little help for me.
My dataset contains 3936 obs of 7 variables.
environment, skill, volume, datetime, year, month, day
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3696 obs. of 7 variables:
$ environment: chr "b2b" "b2b" "b2b" "b2b" ...
$ skill : chr "BO Bedrift" "BO Bedrift" "BO Bedrift" "BO Bedrift" ...
$ year : num 2017 2017 2017 2017 2017 ...
$ month : num 1 1 1 1 1 2 2 2 2 3 ...
$ day : num 2 9 16 23 30 6 13 20 27 6 ...
$ volume : num 360 312 305 222 113 ...
$ datetime : Date, format: "2017-01-02" "2017-01-09" "2017-01-16" "2017-01-23" ...
but when trying to run
volume_ets <- volume_tsbl %>% ETS(volume)
this message shows in the console
Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
I tried somewhat of a shortcut, but nothing helped,
volume_tsbl$volume <- as.numeric(as.character(volume_tsbl$volume))
Tried to run
volume_ets <- volume_tsbl %>% ETS(volume)
this message shows in the console
Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
I tried somewhat of a shortcut, but nothing helped,
volume_tsbl$volume <- as.numeric(as.character(volume_tsbl$volume))
volume_ets <- volume_tsbl %>% ETS(volume)
my tsibble looks like this;
volume_tsbl <- volume %>¤ as_tsibble(key = c(skill, environment), index = c(datetime), regular = TRUE )
Expected the code to run, but it does not.
This is the result of an interface change made in late 2018. The change was to make model functions (such as ETS()) create model definitions, rather than fitted models. Essentially, ETS() no longer accepts data as an input, and the specification for the ETS model would become ETS(volume).
The equivalent code in the current version of fable is:
volume_ets <- volume_tsbl %>% model(ETS(volume))
Where the model() function is used to train one or more model definitions (ETS(volume) in this case) to a given dataset.
You can refer to the pkgdown site for fable to see more details: http://fable.tidyverts.org/
In particular, the ETS() function is documented here: http://fable.tidyverts.org/reference/ETS.html

Plotting time (HMS) with ggplot2

I'm trying to plot a running sessionsI want to make a ggplot with:
x=distance (2.2KM, 5KM, 10KM , 12.8KM, Ziel)
Y= time (HMS)
I have the following data:
'data.frame': 16333 obs. of 6 variables:
$ Numéro : chr "6526" "5427" "6528" "6529" ...
$ X2.2km : chr "00:10:47.4" "00:08:58.2" "00:11:10.4" "00:09:27.3" ...
$ X5km : chr "00:26:05.0" "00:21:46.1" "00:27:13.5" "00:22:35.3" ...
$ X10km : chr "00:56:30.1" "00:45:59.3" "00:58:53.1" "00:47:51.7" ...
$ X12.8km : chr "01:14:24.7" "00:59:50.7" "01:17:35.0" "01:01:42.6" ...
$ Zielzeit: chr "01:37:40.0" "01:16:38.1" "01:41:53.0" "01:19:02.5" ...
the next step is to use melt function from library reshape2 and lubridate
xx<-melt(xx,id="Numéro")
####Using lubridate ####
xx$value<-hms(xx$value)
My problem is here when i try to plot simple graphics, i receive the following message
> ggplot(xx,aes(variable,value))+geom_point()
Error in x < range[1] : cannot compare Period to Duration:
coerce with 'as.numeric' first.
> ggplot(xx,aes(variable,value))+geom_line()
Error in x < range[1] : cannot compare Period to Duration:
coerce with 'as.numeric' first.)
DATASET
xx <- read.table(header=TRUE, text="
Numéro variable value
1 6526 X2.2km 10M 47.4S
2 5427 X2.2km 8M 58.2S
3 6528 X2.2km 11M 10.4S
4 6529 X2.2km 9M 27.3S
5 6530 X2.2km 8M 29.3S")
Thank for any kind of contributions .

R plot spectrogram base on the amplitude data of a wave

In R, if I would like to plot the spectrogram from a wave, it is as following:
>library(sound)
>library(tuneR)
>library(seewave)
>s1<-readWave('sample1.wav')
>spectro(s1,main='s1')
>str(s1)
Formal class 'Wave' [package "tuneR"] with 6 slots
..# left : int [1:312000] 2293 2196 1964 1640 1461 1285 996 600 138 -195 ...
..# right : num(0)
..# stereo : logi FALSE
..# samp.rate: int 8000
..# bit : int 16
..# pcm : logi TRUE
But what if I just have data.txt as
2293 2196 1964 1640 1461 1285 996 600 138 -195 ...
What should I put in the spectro function? spectro(wave, f, ...), wave is said to be an R object. Or I should use others to get the plot? I tried
>s_1<-read.table("s_1.txt", sep=" ")
>spectro(s_1,f=8000)
Error in filled.contour.modif2(x = X, y = Y, z = Z, levels = collevels, :
no proper 'z' matrix specified
and ended with error. Thank you.
I agree the documentation is a little hazy.
What I believe it is trying to say is that the first argument must be a Wave object. You can convert a numeric vector into a Wave object using the TuneR Wave() function.
v <- runif(5000, -2^15, 2^15-1)
v.wav <- Wave(v, samp.rate=8000, bit=16)
spectro(v.wav)
I didn't manage to install seewave on my current computer, so I tested this on an old computer with software a couple of years old. I can't guarantee that this example will work.

Error using roll_lm in a debian machine

The following piece of simple code works perfectly on my local windows machine
require(roll)
x = matrix(rnorm(100),100,1)
y = matrix(rnorm(100),100,1)
roll_lm(x,y,10)
However, on a debian distant machine, it crashes with this error message:
caught illegal operation
address 0x7f867a59ee04,
cause 'illegal operand'
Traceback:
1: .Call("roll_roll_lm", PACKAGE = "roll", x, y, as.integer(width), as.numeric(weights), as.logical(center_x), as.logical(center_y), as.logical(scale_x), as.logical(scale_y), as.integer(min_obs), as.logical(complete_obs), as.logical(na_restore), as.character(match.arg(parallel_for)))
2: roll_lm(x, y, 10)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace*
Option 1 : abort (with core dump, if enabled) gives:
Illegal instruction
I am clueless on how to interpret this message.
Any help? Thanks.
Some info :
R.version _
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 2.5
year 2016
month 04
day 14
svn rev 70478
language R
version.string R version 3.2.5 (2016-04-14)
nickname Very, Very Secure Dishes
The system:
Linux machineName 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u1 x86_64 GNU/Linux
Works for me:
R> require(roll)
R> x = matrix(rnorm(100),100,1)
R> y = matrix(rnorm(100),100,1)
R> str(roll_lm(x,y,10))
List of 2
$ coefficients: num [1:100, 1:2] NA NA NA NA NA ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "(Intercept)" "x1"
$ r.squared : num [1:100, 1] NA NA NA NA NA ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "R-squared"
R>
I suggest you rebuild reinstall package roll.
Sometimes this happens when one component (Rcpp, RcppParallel, ...) gets updated.

Can't run r package mixOmics functions with large matrix

The mixOmics package is meant to analyze big data sets (e.g. from high throughput experiments), but it seems not be working with my big matrix.
I am having issues with both rcc (regularized canonical correlation) and tune.rcc (labmda parameters estimation for regularized can cor).
> str(Y)
num [1:13, 1:17766] ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:13] ...
..$ : chr [1:17766] ...
> str(X)
num [1:13, 1:26] ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:13] ...
..$ : chr [1:26] ...
tune.rcc(X, Y, grid1 = seq(0.001, 1, length = 5),
grid2 = seq(0.001, 1, length = 5),
validation = "loo", plt=F)
On Mavericks: runs forever (I quit R after hours)
Since I know Mavericks is problematic I've tried it on a Windows8 machine and on the mixOmics web interface.
On Windows 8:
Error: cannot allocate vector of size 2.4 Gb
On web interface, since it is not possible to estimate lambdas (tune.rcc) I tried rcc with "some" lambdas and get:
Error: cannot allocate vector of size 2.4 Gb
Am I doing something obviously wrong?
Any help very much appreciated.

Resources