How to activate hyperthreading for ipcluster and MPI - jupyter-notebook

I am starting an IPython cluster with an MPI engine to execute a jupyter notebook on multiple processes:
ipcluster start --engines=MPI -n 6 --profile=mpi
The machine has 6 cores so this works without an issue. However, I would also like to use its 12 threads. How do I tell IPython/the ipcluster command to activate hyperthreading (i.e. pass --use-hwthread-cpus to the mpirun/mpiexec command it executes)?
Error message I receive when trying the above ipcluster command with 12 nodes:
2023-01-06 14:01:35.586 [IPClusterStart] Starting 12 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
2023-01-06 14:01:35.688 [IPClusterStart] WARNING | engine set stopped 1673010095: {'exit_code': 1, 'pid': 187667, 'identifier': 'ipengine-1673010095-187622'}
2023-01-06 14:01:35.689 [IPClusterStart] ERROR |
Engines shutdown early, they probably failed to connect.
Check the engine log files for output.
If your controller and engines are not on the same machine, you probably
have to instruct the controller to listen on an interface other than localhost.
You can set this by adding "--ip=*" to your ControllerLauncher.controller_args.
Be sure to read our security docs before instructing your controller to listen on
a public interface.
2023-01-06 14:01:35.690 [IPClusterStart] ERROR | Engine output:
Invalid MIT-MAGIC-COOKIE-1 key--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 12
slots that were requested by the application:
/****/venv/bin/python
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
2023-01-06 14:01:35.690 [IPClusterStart] ERROR | IPython cluster: stopping
2023-01-06 14:01:35.691 [IPClusterStart] Stopping controller
2023-01-06 14:01:35.691 [IPController] CRITICAL | Received signal 15, shutting down
2023-01-06 14:01:35.692 [IPController] CRITICAL | terminating children...
2023-01-06 14:01:35.816 [IPClusterStart] Controller stopped: {'exit_code': 0, 'pid': 187624, 'identifier': 'ipcontroller-187622'}
2023-01-06 14:01:35.816 [IPClusterStart] Stopping engine(s): 1673010095

Related

Vtune throws aps error while running MPI samples

I am trying to profile my MPI application using Intel Vtune. While trying to run the below two commands, I am getting error.
I_MPI_DEBUG=5.
I tried two things,
export I_MPI_DEBUG = 5
mpirun -np 4 aps ~/binary/vasp_std_2022
mpirun -genv I_MPI_DEBUG=5 -np 4 aps ~/binary/vasp_std_2022
vtune: Warning: Memory bandwidth collection is not supported inside a virtual machine since uncore events cannot be collected. For full functionality, consider using a bare-metal environment.
vtune: Warning: CPU frequency data collection is not supported on this platform.
vtune: Error: amplxe-perf:
Using CPUID GenuineIntel-6-6A-6
both cgroup and no-aggregation modes only available in system-wide mode
Usage: perf stat [] []
-G, --cgroup monitor event in cgroup name only
-A, --no-aggr disable CPU count aggregation
-a, --all-cpus system-wide collection from all CPUs
--for-each-cgroup expand events for each cgroup
vtune: Error: Preliminary validation of the requested events failed.
aps Error: Cannot run the collection.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
In the command you are using 'aps' while profiling.
If you are using aps command you need to use some parameters like collection-mode as --collection-mode=<mode>
This parameter is used to specify a comma separated list of data to collect. Possible values:
hwc : hardware counters
omp : openMP statistics
mpi : MPI statistcs
all : all possible data(default)
Try to use the command as below
mpirun -genv I_MPI_DEBUG= 5 -np 4 aps --collect-mode=omp ./obj

Openmpi 4.0.5 fails to distribute tasks to more than 1 node

We are having trouble with openmpi 4.0.5 on our cluster: It works as long as only 1 node is requested, but as soon as more than 1 is requested (e.g. mpirun -np 24 ./hello_world with --ntasks-per-node=12) it crashes and we get the following error message:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:
./hello_world
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
I have tried using --oversubscribe, but this will still only use 1 node, even though smaller jobs would run that way. I have also tried specifically requesting nodes (e.g. -host node36,node37), but this results in the following error message:
[node37:16739] *** Process received signal ***
[node37:16739] Signal: Segmentation fault (11)
[node37:16739] Signal code: Address not mapped (1)
[node37:16739] Failing at address: (nil)
[node37:16739] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x2ac57d70e5f0]
[node37:16739] [ 1] /lib64/libc.so.6(+0x13ed5a)[0x2ac57da59d5a]
[node37:16739] [ 2] /usr/lib64/openmpi/lib/libopen-rte.so.12(orte_daemon+0x10d7)[0x2ac57c6c4827]
[node37:16739] [ 3] orted[0x4007a7]
[node37:16739] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac57d93d505]
[node37:16739] [ 5] orted[0x400810]
[node37:16739] *** End of error message ***
The cluster has 59 nodes. Slurm 19.05.0 is used as a scheduler and gcc 9.1.0 to compile.
I don't have much experience with mpi - any help would be much appreciated! Maybe someone is familiar with this error and could point me towards what the problem might be.
Thanks for your help,
Johanna

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

Error in MPI program execution - no active ports found

I am trying to run a simple MPI job across multiple hosts of a cluster.
[capc#gpu6 mpi_tests]$ /opt/openmpi4.0.3/build/bin/mpirun --host gpu7,gpu6 ./a.out
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu7
We have 2 processes.
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: gpu6
PID: 29209
[gpu6:29203] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[gpu6:29203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I have compiled the MPI program with mpicc and on running with mpirun it hangs.
Can anyone guide me regarding this?

mpi_comm_spawn on remote nodes

How does one use MPI_Comm_spawn to start worker processes on remote nodes?
Using OpenMPI 1.4.3, I've tried this code:
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", "node2");
MPI_Comm intercom;
MPI_Comm_spawn("worker",
MPI_ARGV_NULL,
nprocs,
info,
0,
MPI_COMM_SELF,
&intercom,
MPI_ERRCODES_IGNORE);
But that fails with this error message:
--------------------------------------------------------------------------
There are no allocated resources for the application
worker
that match the requested mapping:
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
If I replace the "node2" with the name of my local machine, then it works fine. If I ssh into node2 and run the same thing there (with "node2" in the info dictionary) then it also works fine.
I don't want to start the parent process with mpirun, so I'm just looking for a way to dynamically spawn processes on remote nodes. Is this possible?
I don't want to start the parent
process with mpirun, so I'm just
looking for a way to dynamically spawn
processes on remote nodes. Is this
possible?
I'm not sure why you don't want to start it with mpirun? You're implicitly starting up the whole MPI machinery anyway as soon as you hit MPI_Init(), this way you just get to pass it options rather than relying on the default.
The issue here is simply that when the MPI library starts up (at MPI_Init()) it doesn't see any other hosts available, because you haven't given it any with the --host or --hostfile options to mpirun. It won't just launch processes elsewhere on your say-so (indeed, spawn doesn't require Info host, so in general it wouldn't even know where to go otherwise), so it fails.
So you'll need to do
mpirun --host myhost,host2 -np 1 ./parentjob
or, more generally, provide a hostfile, preferably with a number of slots available
myhost slots=1
host2 slots=8
host3 slots=8
and launch the jobs this way, mpirun --hostfile mpihosts.txt -np 1 ./parentjob This is a feature, not a bug; now it's MPIs job to figure out where the workers go, and if you don't specify a host explicitly in the info, it'll try to put it in the most underutilized place. It also means you don't have to recompile to change the hosts you'll spawn to.

Resources