slurm can't execute srun - mpi

slurm cannot run
srun -n 1 /home/user/share/test/hello.o
is ok
rpi40000: 0 of 1
but
srun -n 2 /home/user/share/test/hello.o
is error.
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for StepId=53.0 failed on node rpi40001: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd-rpi40000: error: *** STEP 53.0 ON rpi40000 CANCELLED AT 2022-06-27T01:33:37 ***
srun: error: rpi40000: task 0: Killed

Related

mpirun running job serially with only one core

I have installed mpich4.1 in ubuntu machine using GNU compiler. In the beginning I ran one job successfully using mpirun on '36' cores, but now when I'm trying to run same job it's running serially using only one core. Now the command output of mpirun -np 36 ./wrf.exe is
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
The mpivars gives error with
Abort(470406415): Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(67): MPI_Init_thread(argc=0x7fff8044f34c, argv=0x7fff8044f340, required=0, provided=0x7fff8044f350) failed
MPII_Init_thread(222)...: gpu_init failed
But the machine is not having GPU.
The mpi version command gives
HYDRA build details:
Version: 4.1
Release Date: Fri Jan 27 13:54:44 CST 2023
CC: gcc
Configure options: '--disable-option-checking' '--prefix=/home/MODULES' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -DNETMOD_INLINE=__netmod_inline_ofi__ -I/home/MODULES/mpich-4.1/src/mpl/include -I/home/MODULES/mpich-4.1/modules/json-c -D_REENTRANT -I/home/MODULES/mpich-4.1/src/mpi/romio/include -I/home/MODULES/mpich-4.1/src/pmi/include -I/home/MODULES/mpich-4.1/modules/yaksa/src/frontend/include -I/home/MODULES/mpich-4.1/modules/libfabric/include'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Demux engines available: poll select
What could be the possible reason for this?
Thanks in advance.

#dask Local abort before MPI_INIT completed completed successfully

I run command as follows.
mpirun --hostfile /home/user/share/hostlist.txt -np 4 /home/user/share/mpi-dask/venv/bin/dask-mpi --scheduler-file ~/dask-scheduler.json
I got result as follows.
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[rpi40000:14497] Local abort
before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
2022-06-23 06:40:12,321 - distributed.nanny - INFO - Worker process 14497 exited with status 1
2022-06-23 06:40:12,324 - distributed.nanny - WARNING - Restarting worker
^C[rpi40000:14416] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[rpi40000:14416] 8 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[rpi40000:14416] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[rpi40000:14416] 5 more processes have sent help message help-orte-runtime / orte_init:startup:internal-failure
[rpi40000:14416] 5 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure

MPI error from very simple example fortran code

I have compiled this code:
program mpisimple
implicit none
integer ierr
include 'mpif.h'
call mpi_init(ierr)
write(6,*) 'Hello World!'
call mpi_finalize(ierr)
end
using the command: mpif90 -o helloworld simplempi.f90
When I run with this command:
$ mpiexec -np 1 ./helloworld
Hello World!
it works fine as you can see. But when I run with any other number of processors (here 4) I get the errors and I basically have to ctrl+C to kill it.
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1859)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(858)...: fail failed
MPIDI_CH3I_Seg_commit(427): PMI_KVS_Get returned 4
In: PMI_Abort(69777679, Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1859)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(858)...: fail failed
MPIDI_CH3I_Seg_commit(427): PMI_KVS_Get returned 4)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
What could be the problem? I am doing this on a Linux hpc system.
I figured out why this happened. The system I am using does not require users to submit single-core jobs through the scheduler, but does require it for multi-core jobs. Once the mpiexec command was submitted through a PBS bash script, the errors went away and output was as expected.

Supervisord - NGINX stop OSError

I ran into an error when trying to stop NGINX using supervisord.
To start NGINX without error from supervisord I had to prepend sudo to the nginx command in supervisord.conf:
[supervisord]
[program:nginx]
command=sudo nginx -c %(ENV_PWD)s/configs/nginx.conf
When I run this:
$ supervisord -n
2017-02-09 12:26:06,371 INFO RPC interface 'supervisor' initialized
2017-02-09 12:26:06,372 INFO RPC interface 'supervisor' initialized
2017-02-09 12:26:06,372 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2017-02-09 12:26:06,373 INFO supervisord started with pid 22152
2017-02-09 12:26:07,379 INFO spawned: 'nginx' with pid 22155
2017-02-09 12:26:08,384 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
^C # SIGINT: Should stop all processes
2017-02-09 13:59:08,550 WARN received SIGINT indicating exit request
2017-02-09 13:59:08,551 CRIT unknown problem killing nginx (22155):Traceback (most recent call last):
File "/Users/ocervell/.virtualenvs/ndc-v3.3/lib/python2.7/site-packages/supervisor/process.py", line 432, in kill
options.kill(pid, sig)
File "/Users/ocervell/.virtualenvs/ndc-v3.3/lib/python2.7/site-packages/supervisor/options.py", line 1239, in kill
os.kill(pid, signal)
OSError: [Errno 1] Operation not permitted
Same when using supervisorctl to stop the process:
$ supervisorctl stop nginx
FAILED: unknown problem killing nginx (22321):Traceback (most recent call last):
File "/Users/ocervell/.virtualenvs/ndc-v3.3/lib/python2.7/site-packages/supervisor/process.py", line 432, in kill
options.kill(pid, sig)
File "/Users/ocervell/.virtualenvs/ndc-v3.3/lib/python2.7/site-packages/supervisor/options.py", line 1239, in kill
os.kill(pid, signal)
OSError: [Errno 1] Operation not permitted
Is there a workaround for this ?
If a process created by supervisord creates its own child processes, supervisord cannot kill them.
...
The pidproxy program is put into your configuration’s $BINDIR when supervisor is installed (it is a “console script”).[1]
So what you have to do is changing your supervisord configuration like this:
[program:nginx]
command=/path/to/pidproxy /path/to/nginx-pidfile sudo nginx -c %(ENV_PWD)s/configs/nginx.conf
This may not work either, since the nginx process is create by sudo. But let's try it first.

SparkR Error in if (returnStatus !=0)

I have setup a spark cluster on EC2 with 20 nodes and set all the node IPs in conf/slaves of the master and launched a job with SparkR and 50 slices. My nodes are dual core with 4GB memory and at the end of my job collect the results into a csv file which should contain about 15000 lines (and 7 columns with floats). The job runs fine for a while (6000s) until I get the following error from the master (this is not from the spakr master log, but from the terminal window where I execute the spark job):
16/03/21 22:39:31 INFO TaskSetManager: Finished task 27.0 in stage 0.0 (TID 27) in 5954810 ms on ip-xxx-yy-xx-zzz.somewhere.compute.internal (8/40)
16/03/21 22:39:38 INFO TaskSetManager: Finished task 12.0 in stage 0.0 (TID 12) in 5962190 ms on ip-xxx-xx-xx-xxx.somewhere.compute.internal (9/40)
Error in if (returnStatus != 0) { : argument is of length zero
Calls: <Anonymous> -> <Anonymous> -> .local -> callJMethod -> invokeJava
Execution halted
16/03/21 22:40:16 INFO SparkContext: Invoking stop() from shutdown hook
16/03/21 22:40:16 INFO SparkUI: Stopped Spark web UI at http://172.31.21.134:4040
16/03/21 22:40:16 INFO DAGScheduler: Job 0 failed: collect at NativeMethodAccessorImpl.java:-2, took 6001.135894 s
16/03/21 22:40:16 INFO DAGScheduler: ShuffleMapStage 0 (RDD at RRDD.scala:36) failed in 6000.500 s
16/03/21 22:40:16 ERROR RBackendHandler: collect on 16 failed
16/03/21 22:40:16 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#6c9d21b2)
16/03/21 22:40:16 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1458600016592,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down))
16/03/21 22:40:16 INFO SparkDeploySchedulerBackend: Shutting down all executors
I checked in the worker logs and I see the following two lines at the end of the log file:
16/03/21 22:40:16 INFO CoarseGrainedExecutorBackend: Driver commanded
a shutdown 16/03/21 22:40:16 ERROR CoarseGrainedExecutorBackend:
RECEIVED SIGNAL 15: SIGTERM
and then the log stops abruptly (no other errors or warnings before).
I don't see any hint towards what could cause the crash in the log file, my only guess is that it could be an out-of-memory error because when I run on a reduced input dataset it runs fine. Am I missing something?

Resources