Data unpack would read past end of buffer in file util/show_help.c at line 501 - mpi

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

Related

How to activate hyperthreading for ipcluster and MPI

I am starting an IPython cluster with an MPI engine to execute a jupyter notebook on multiple processes:
ipcluster start --engines=MPI -n 6 --profile=mpi
The machine has 6 cores so this works without an issue. However, I would also like to use its 12 threads. How do I tell IPython/the ipcluster command to activate hyperthreading (i.e. pass --use-hwthread-cpus to the mpirun/mpiexec command it executes)?
Error message I receive when trying the above ipcluster command with 12 nodes:
2023-01-06 14:01:35.586 [IPClusterStart] Starting 12 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
2023-01-06 14:01:35.688 [IPClusterStart] WARNING | engine set stopped 1673010095: {'exit_code': 1, 'pid': 187667, 'identifier': 'ipengine-1673010095-187622'}
2023-01-06 14:01:35.689 [IPClusterStart] ERROR |
Engines shutdown early, they probably failed to connect.
Check the engine log files for output.
If your controller and engines are not on the same machine, you probably
have to instruct the controller to listen on an interface other than localhost.
You can set this by adding "--ip=*" to your ControllerLauncher.controller_args.
Be sure to read our security docs before instructing your controller to listen on
a public interface.
2023-01-06 14:01:35.690 [IPClusterStart] ERROR | Engine output:
Invalid MIT-MAGIC-COOKIE-1 key--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 12
slots that were requested by the application:
/****/venv/bin/python
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
2023-01-06 14:01:35.690 [IPClusterStart] ERROR | IPython cluster: stopping
2023-01-06 14:01:35.691 [IPClusterStart] Stopping controller
2023-01-06 14:01:35.691 [IPController] CRITICAL | Received signal 15, shutting down
2023-01-06 14:01:35.692 [IPController] CRITICAL | terminating children...
2023-01-06 14:01:35.816 [IPClusterStart] Controller stopped: {'exit_code': 0, 'pid': 187624, 'identifier': 'ipcontroller-187622'}
2023-01-06 14:01:35.816 [IPClusterStart] Stopping engine(s): 1673010095

Mariadb Galera 10.5.13-16 Node Crash

I have a cluster with 2 galera nodes and 1 arbitrator.
My node 1 crashed I don't understand why..
Here is the log of the node 1.
It seems that it is a problem with the pthread library.
Also every requests are proxied by 2 HAProxy.
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:56 0 [Warning] WSREP: Handshake failed: http request
terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>'
what(): remote_endpoint: Transport endpoint is not connected
230103 12:08:56 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.5.13-MariaDB-1:10.5.13+maria~focal
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=101
max_threads=102
thread_count=106
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 760333 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mariadbd(my_print_stacktrace+0x32)[0x55b1b67f7e42]
Printing to addr2line failed
mariadbd(handle_fatal_signal+0x485)[0x55b1b62479a5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff88ea983c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff88e59e18b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff88e57d859]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff88e939911]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff88e94538c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff88e9453f7]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff88e9456a9]
/usr/lib/galera/libgalera_smm.so(+0x448ad)[0x7ff884b5e8ad]
/usr/lib/galera/libgalera_smm.so(+0x1fc315)[0x7ff884d16315]
/usr/lib/galera/libgalera_smm.so(+0x1ff7eb)[0x7ff884d197eb]
/usr/lib/galera/libgalera_smm.so(+0x1ffc28)[0x7ff884d19c28]
/usr/lib/galera/libgalera_smm.so(+0x2065b6)[0x7ff884d205b6]
/usr/lib/galera/libgalera_smm.so(+0x1f81f3)[0x7ff884d121f3]
/usr/lib/galera/libgalera_smm.so(+0x1e6f04)[0x7ff884d00f04]
/usr/lib/galera/libgalera_smm.so(+0x103438)[0x7ff884c1d438]
/usr/lib/galera/libgalera_smm.so(+0xe8eea)[0x7ff884c02eea]
/usr/lib/galera/libgalera_smm.so(+0xe9a8d)[0x7ff884c03a8d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff88ea8c609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff88e67a293]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Fatal signal 11 while backtracing
PS: if you want more data ask me :)
OK it seems that 2 simultaneous scans of OpenVAS crashes the node.
I tried with version 10.5.13 and 10.5.16 -> crash.
Solution: Upgrade to 10.5.17 at least.

Airflow Exception - Task received SIGTERM signal

I am running airflow tasks using SSH operator. I am pretty sure that the python program has no error and runs successfully when i run it. But when run from airflow towards the end of program execution I end up with SIGTERM error.
I tried to figure out by looking into various solutions but nothing worked. I tried increasing
killed_task_cleanup_time = 1200 from 60 in airflow.cfg file. Also tried changing hostname_callable to socket:gethostname in airflow.cfg as I received the following warning before this error
Warning: The recorded hostname xxx does not match this instance's hostname
Error:
[2020-10-15 10:45:34,937] {taskinstance.py:954} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-10-15 10:45:34,959] {taskinstance.py:1145} ERROR - SSH operator error: Task received SIGTERM signal
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/airflow/contrib/operators/ssh_operator.py", line 137, in execute
readq, _, _ = select([channel], [], [], self.timeout)
File "/opt/anaconda3/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 956, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
Any ideas and suggestions are teally helpful. Stuck with this for a day now
This problem is triggered by the fact that the RECORDED hostname XXX maps an IP address that is different from the IP address mapped by instance's hostname, throwing a SIGTERM error. So you need to specify the IP mapping for the recorded Hostname XXX
Possibly this thread might help? https://issues.apache.org/jira/browse/AIRFLOW-966.
Which version of airflow are you using, and did you check your celery broker settings?
The solution seems to be setting visibility timeout higher than the celery default, which is 1 hour, to prevent celery from re-submitting the job. I believe this only affects tasks created via manual run / CLI (not normally scheduled tasks.)

Error in MPI program execution - no active ports found

I am trying to run a simple MPI job across multiple hosts of a cluster.
[capc#gpu6 mpi_tests]$ /opt/openmpi4.0.3/build/bin/mpirun --host gpu7,gpu6 ./a.out
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu7
We have 2 processes.
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: gpu6
PID: 29209
[gpu6:29203] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[gpu6:29203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I have compiled the MPI program with mpicc and on running with mpirun it hangs.
Can anyone guide me regarding this?

bind failure: Address already in use even though recycle and reuse flags are set to 1

Environment:
Unix client and unix server.
Tool used : curl.
Client/Server should ignore the time wait time (2 *MSL ) when establishing connection.
This is done by executing the following commands :
sysctl net.ipv4.tcp_tw_reuse=1
sysctl net.ipv4.tcp_tw_recycle=1
Local port must be specified so that it can re-used.
Start the connection.
Example : while [ 1 ]; do curl --local-port 9056 192.168.40.2; sleep 30; done
I am still seeing the error even though it should have ignored time wait period.
Any idea why this is happening?

Resources