'Experiment DEAD got' warnings in H2O DAI log - driverless-ai

I see lots of 'Experiment DEAD got' warnings in the H2O DAI experiment log. What does this mean ?
2019-10-10 10:37:11,441 C: D:109.4GB M:6.0GB 15175 WARNING: Experiment DEAD got 426a3c9c-eafe-11e9-8651-9cb6d060f074 server process (pids were: []) [=#h2oai-426a3c9c].

Related

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

Airflow simple tasks failing without logs with small parallelism LocalExecutor (was working with SequentialExecutor)

Running airflow (v1.10.5) dag that ran fine with SequentialExecutor now has many (though not all) simple tasks that fail without any log information when running with LocalExecutor and minimal parallelism, eg.
<airflow.cfg>
# overall task concurrency limit for airflow
parallelism = 8 # which is same as number of cores shown by lscpu
# max tasks per dag
dag_concurrency = 2
# max instances of a given dag that can run on airflow
max_active_runs_per_dag = 1
# max threads used per worker / core
max_threads = 2
# 40G of RAM available total
# CPUs: 8 (sockets 4, cores per socket 4)
see https://www.astronomer.io/guides/airflow-scaling-workers/
Looking at the airflow-webserver.* logs nothing looks out of the ordinary, but looking at airflow-scheduler.out I see...
[airflow#airflowetl airflow]$ tail -n 20 airflow-scheduler.out
....
[2019-12-18 11:29:17,773] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table1 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,779] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table2 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,782] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table3 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:18,833] {scheduler_job.py:832} WARNING - Set 1 task instances to state=None as their associated DagRun was not in RUNNING state
[2019-12-18 11:29:18,844] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table4 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status success for try_number 1
....
but not really sure what to take away from this.
Anyone know what could be going on here or how to get more helpful debugging info?
Looking again at my lscpu specs, I noticed...
[airflow#airflowetl airflow]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Notice Thread(s) per core: 1
Looking at my airflow.cfg settings I see max_threads = 2. Setting max_threads = 1 and restarting both the scheduler seems to have fixed the problem.
If anyone knows more about what exactly is going wrong under the hood (eg. why the task fails rather than just waiting for another thread to become available), would be interested to hear about it.

How to stop h2o from saving massive .ERR, .OUT and other log files to the local drive

I am currently running a script in which several h2o glm and deeplearning models are being generated for several iterations of a Monte-Carlo Cross-Validation. When finished running (which takes about half a day), h2o is saving immense files to the local drive (with sizes up to 8.5 GB). These files are not erased when RStudio or my computer is restarted (as I originally thought). Is there a way to stop h2o from saving these files?
When you start H2O with h2o.init() from R, the stdout and stderr files should be saved to a temporary directory (see R's tempdir() to see the path). This temporary directory should be removed when the R session exits. It seems as though this is not working with RStudio, however it works if you are using R from the command line. I'm not sure if this is a setting that can be changed in RStudio or if this is an RStudio bug.
But you can take more control yourself. You can start H2O by hand using java on the command line and then connect from R using h2o.init().
java -Xmx5g -jar h2o.jar
In this example, I started H2O with 5 GB of Java heap memory, but you should increase that if your data is larger. Then connecting in R will look like this:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 16 hours 34 minutes
H2O cluster version: 3.15.0.99999
H2O cluster version age: 17 hours and 25 minutes
H2O cluster name: H2O_started_from_R_me_exn817
H2O cluster total nodes: 1
H2O cluster total memory: 4.43 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.2 (2016-10-31)
So if you want to redirect both stdout and stderr to devnull you simply add the redirect command to the end of the java command to start the H2O cluster and connect to H2O from R again. To redirect both stderr and stdout, you append > /dev/null 2>&1 like this:
java -Xmx5g -jar h2o.jar > /dev/null 2>&1 &
Encountered this in a spark shell running H2O. The shell had 50 executors connected and this caused the /tmp directories on those nodes to eventually cause storage issues.
When h2o.init() is called it creates jvm's. The logging from h2o is handled by these jvm's. But when the shell is shutdown those jvm's persist and just log heartbeat errors in /tmp in perpetuity. You will need to find the jvm's associated with h2o and shut them down. I believe in my case the specific process names where water.H2OApp
I found it easier to take care of the problem by removing those files after running every model.
unlink(list.files(tempdir(), full.names = TRUE), recursive = TRUE)
This helps remove the temporary files when I run the multiple models in a loop.

Read a large (1.5 GB) file in h2o R

I am using h2o package for modelling in R. For this I want to read a dataset which has a size of about 1.5 GB using h2o.importfile(). I start the h2o server using the lines
library(h2oEnsemble)
h2o.init(max_mem_size = '1499m',nthreads=-1)
This produces a log
H2O is not running yet, starting it now...
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)
Starting H2O JVM and connecting: . Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 seconds 665 milliseconds
H2O cluster version: 3.10.4.8
H2O cluster version age: 28 days, 14 hours and 36 minutes
H2O cluster name: H2O_started_from_R_Lucifer_jvn970
H2O cluster total nodes: 1
H2O cluster total memory: 1.41 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)`
The following line gives me an error
train=h2o.importFile(path=normalizePath("C:\\Users\\All data\\traindt.rds"))
DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
at water.MRTask.getResult(MRTask.java:478)
at water.MRTask.getResult(MRTask.java:486)
at water.MRTask.doAll(MRTask.java:402)
at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:246)
at water.parser.ParseDataset.access$000(ParseDataset.java:27)
at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:195)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1315)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.AssertionError
at water.parser.Categorical.addKey(Categorical.java:41)
at water.parser.FVecParseWriter.addStrCol(FVecParseWriter.java:127)
at water.parser.CsvParser.parseChunk(CsvParser.java:133)
at water.parser.Parser.readOneFile(Parser.java:187)
at water.parser.Parser.streamParseZip(Parser.java:217)
at water.parser.ParseDataset$MultiFileParseTask.streamParse(ParseDataset.java:907)
at water.parser.ParseDataset$MultiFileParseTask.map(ParseDataset.java:856)
at water.MRTask.compute2(MRTask.java:601)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1318)
at water.parser.ParseDataset$MultiFileParseTask$Icer.compute1(ParseDataset$MultiFileParseTask$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1314)
... 5 more
Error: DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
Any help on how to fix this problem?
Note: Assigning memory larger than 1499mb also gives me an error (cannot allocate memory). I am using a 16GB ram environment
Edit: I download the 64-bit version of Java and changed my file to a csv file. I was then able to assign max_mem_size to 5G and the problem was solved.
For others who face the problem:
1. Download the latest version of 64 bit jdk
2. Execute the following line line
h2o.init(max_mem_size = '5g',nthreads=-1)
You are running with 32-bit java, which is limiting the memory that you are able to start H2O with. One clue is that it won't start with a higher max_mem_size. Another clue is that it says "Client VM".
You want 64-bit java instead. The 64-bit version will say "Server VM". You can download the Java 8 SE JDK from here:
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Based on what you've described, I recommend setting max_mem_size = '6g' or more, which will work fine on your system once you have the right version of Java installed.
train=h2o.importFile(path=normalizePath("C:\\Users\\All data\\traindt.rds")
Are you trying to load an .rds file? That's an R binary format which is not readable by h2o.importFile(), so that won't work. You will need to store your training data in a cross-platform storage format (e.g. CSV, SMVLight, etc) if you want to read it into H2O directly. If you don't have a copy in another format, then just save one from R:
# loads a `train` data.frame for example
load("C:\\Users\\All data\\traindt.rds")
# save as CSV
write.csv(train, "C:\\Users\\All data\\traindt.csv")
# import from CSV into H2O cluster directly
train = h2o.importFile(path = normalizePath("C:\\Users\\All data\\traindt.csv"))
Another option is to load it into R from the .rds file and use the as.h2o() function:
# loads a `train` data.frame for example
load("C:\\Users\\All data\\traindt.rds")
# send to H2O cluster
hf <- as.h2o(train)

Pipeline process 5 (iteration) caused an error: Redundant argument in sprintf at /usr/bin/pt-query-digest line 2556

I am using percona-toolkit for analysing mysql-slow-query (logs). So the command is pretty basic:
pt-query-digest slowquery.log
Now the result(error) is:
18.2s user time, 100ms system time, 35.61M rss, 105.19M vsz
Current date: Thu Jul 7 17:18:43 2016
Hostname: Jammer
Files: slowquery.log
Pipeline process 5 (iteration) caused an error: Redundant argument in sprintf at /usr/bin/pt-query-digest line 2556.
Will retry pipeline process 4 (iteration) 2 more times.
..
..(same result prints twice)
..
The pipeline caused an error: Pipeline process 5 (iteration) caused an error: Redundant argument in sprintf at /usr/bin/pt-query-digest line 2556.
Terminating pipeline because process 4 (iteration) caused too many errors.
Now the specifics for the environment, I am using Ubuntu 16.04 , MariaDB 10.1.14, Percona-Toolkit 2.2.16
I found something here bug-report, but it is like a workaround and does not actually solve the error. Even after applying the patch the command result doesn't look satisfying enough.
I am facing same problem on ubuntu 16.04 MySql.
The contents of my slow query log is as follow.
/usr/sbin/mysqld, Version: 5.7.16-0ubuntu0.16.04.1-log ((Ubuntu)). started with:
Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock
Time Id Command Argument
/usr/sbin/mysqld, Version: 5.7.16-0ubuntu0.16.04.1-log ((Ubuntu)). started with:
Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock
Time Id Command Argument
Time: 2016-12-08T05:13:55.140764Z
User#Host: root[root] # localhost [] Id: 20
Query_time: 0.003770 Lock_time: 0.000200 Rows_sent: 1 Rows_examined: 2
SET timestamp=1481174035;
SELECT COUNT(*) FROM INFORMATION_SCHEMA.TRIGGERS;
The error is same:
The pipeline caused an error: Pipeline process 5 (iteration) caused an
error: Redundant argument in sprintf at /usr/bin/pt-query-digest line 2556.
Ubuntu 16.04
MySql Ver 14.14 Distrib 5.7.16
pt-query-digest 2.2.16
The bug appears to be fixed in the current version of the toolkit (2.2.20), and apparently in previous ones, starting from 2.2.17.
This patch seems to do the trick for this particular place in pt-query-digest:
--- percona-toolkit-2.2.16/bin/pt-query-digest 2015-11-06 14:56:23.000000000 -0500
+++ percona-toolkit-2.2.20/bin/pt-query-digest 2016-12-06 17:01:51.000000000 -0500
## -2555,8 +2583,8 ##
}
return sprintf(
$num =~ m/\./ || $n
- ? "%.${p}f%s"
- : '%d',
+ ? '%1$.'.$p.'f%2$s'
+ : '%1$d',
$num, $units[$n]);
}
But as mentioned in the original question and bug report, quite a few tools/functions were affected, the full bugfix consisted of a lot of small changes:
https://github.com/percona/percona-toolkit/pull/73/files
I might be late here. I want to share how I overcame that same error as it might help someone who is searching for an answer. At this time the latest tag of Percona toolkit is 3.0.9
I tried to run pt-query-digest after installing via apt, by downloading deb file as methods provided by Percona documentation, but any of it didn't help. It was this same error.
Pipeline process 5 (iteration) caused an error:
Redundant argument in sprintf at /usr/bin/pt-query-digest line (some line)
1 - So I deleted/removed the installation of percona-toolkit
2 - first, I cleaned/updated perl version
sudo apt-get install perl
3 - then I installed Percona toolkit from source as mentioned in the repository's readme. like this. I used branch 3.0.
git clone git#github.com:percona/percona-toolkit.git
cd percona-toolkit
perl Makefile.PL
make
make test
make install
Thats it. Hope this help to someone.
i found error in this version percona-toolkit-3.0.12-1.el7.x86_64.rpm
and percona-toolkit-3.0.10-1.el7.x86_64.rpm is fine, percona-toolkit is very useful to me
at ./pt-query-digest line 9302.
Terminating pipeline because process 4 (iteration) caused too many errors.
Note that you will see the error message:
"Redundant argument in sprintf at"
if you forget to put a % in front of your format spec (first argument).

Resources