mpirun running job serially with only one core - mpi

I have installed mpich4.1 in ubuntu machine using GNU compiler. In the beginning I ran one job successfully using mpirun on '36' cores, but now when I'm trying to run same job it's running serially using only one core. Now the command output of mpirun -np 36 ./wrf.exe is
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
The mpivars gives error with
Abort(470406415): Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(67): MPI_Init_thread(argc=0x7fff8044f34c, argv=0x7fff8044f340, required=0, provided=0x7fff8044f350) failed
MPII_Init_thread(222)...: gpu_init failed
But the machine is not having GPU.
The mpi version command gives
HYDRA build details:
Version: 4.1
Release Date: Fri Jan 27 13:54:44 CST 2023
CC: gcc
Configure options: '--disable-option-checking' '--prefix=/home/MODULES' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -DNETMOD_INLINE=__netmod_inline_ofi__ -I/home/MODULES/mpich-4.1/src/mpl/include -I/home/MODULES/mpich-4.1/modules/json-c -D_REENTRANT -I/home/MODULES/mpich-4.1/src/mpi/romio/include -I/home/MODULES/mpich-4.1/src/pmi/include -I/home/MODULES/mpich-4.1/modules/yaksa/src/frontend/include -I/home/MODULES/mpich-4.1/modules/libfabric/include'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Demux engines available: poll select
What could be the possible reason for this?
Thanks in advance.

Related

How to activate hyperthreading for ipcluster and MPI

I am starting an IPython cluster with an MPI engine to execute a jupyter notebook on multiple processes:
ipcluster start --engines=MPI -n 6 --profile=mpi
The machine has 6 cores so this works without an issue. However, I would also like to use its 12 threads. How do I tell IPython/the ipcluster command to activate hyperthreading (i.e. pass --use-hwthread-cpus to the mpirun/mpiexec command it executes)?
Error message I receive when trying the above ipcluster command with 12 nodes:
2023-01-06 14:01:35.586 [IPClusterStart] Starting 12 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
2023-01-06 14:01:35.688 [IPClusterStart] WARNING | engine set stopped 1673010095: {'exit_code': 1, 'pid': 187667, 'identifier': 'ipengine-1673010095-187622'}
2023-01-06 14:01:35.689 [IPClusterStart] ERROR |
Engines shutdown early, they probably failed to connect.
Check the engine log files for output.
If your controller and engines are not on the same machine, you probably
have to instruct the controller to listen on an interface other than localhost.
You can set this by adding "--ip=*" to your ControllerLauncher.controller_args.
Be sure to read our security docs before instructing your controller to listen on
a public interface.
2023-01-06 14:01:35.690 [IPClusterStart] ERROR | Engine output:
Invalid MIT-MAGIC-COOKIE-1 key--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 12
slots that were requested by the application:
/****/venv/bin/python
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
2023-01-06 14:01:35.690 [IPClusterStart] ERROR | IPython cluster: stopping
2023-01-06 14:01:35.691 [IPClusterStart] Stopping controller
2023-01-06 14:01:35.691 [IPController] CRITICAL | Received signal 15, shutting down
2023-01-06 14:01:35.692 [IPController] CRITICAL | terminating children...
2023-01-06 14:01:35.816 [IPClusterStart] Controller stopped: {'exit_code': 0, 'pid': 187624, 'identifier': 'ipcontroller-187622'}
2023-01-06 14:01:35.816 [IPClusterStart] Stopping engine(s): 1673010095

OpenMPI does not recognize multiple nodes?

I am trying to run a Julia script in paralell on a cluster.
The cluster uses Moab and Torque for the scheduler and resource manager.
Since SSH seems to be restricted, I use MPI for multiprocessing.
I throw the following job, requesting for 3 nodes:
#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l pmem=10gb
#PBS -l nodes=3:ppn=1
#PBS -j oe
#PBS -A open
#PBS -o (some path)
#PBS -e (some path)
cd (some path)
echo ""
echo "JOB Started on $(hostname -s) at $(date)"
echo ""
module purge
module use (some path)/modules
module load julia
module load openmpi
mpirun -np 3 -display-allocation julia --project=. "(some path)/test.jl"
echo ""
echo "JOB ended at $(date)"
But it if I look at the output script, it seems that it recognizes only one node, comp-bc-0384:
JOB Started on comp-bc-0384 at Sat Mar 19 22:05:12 EDT 2022
====================== ALLOCATED NODES ======================
comp-bc-0384: slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
[[12308,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: comp-bc-0384
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] 2 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
10.214858 seconds (116.21 k allocations: 6.110 MiB)
JOB ended at Sat Mar 19 22:05:36 EDT 2022
I was expecting the ALLOCATED NODES section to display the other node(s) I was assigned to.
A similar question in the past (openMPI/mpich2 doesn't run on multiple nodes) suggests that it has something to do with host file.
Therefore I also tried with mpirun -hostfile $PBS_NODEFILE -np 3 -display-allocation julia --project=. "(some path)/test.jl" . It then returns the following:
JOB Started on comp-bc-0384 at Sat Mar 19 22:16:15 EDT 2022
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
JOB ended at Sat Mar 19 22:16:16 EDT 2022
What could be the cause here?

Airflow simple tasks failing without logs with small parallelism LocalExecutor (was working with SequentialExecutor)

Running airflow (v1.10.5) dag that ran fine with SequentialExecutor now has many (though not all) simple tasks that fail without any log information when running with LocalExecutor and minimal parallelism, eg.
<airflow.cfg>
# overall task concurrency limit for airflow
parallelism = 8 # which is same as number of cores shown by lscpu
# max tasks per dag
dag_concurrency = 2
# max instances of a given dag that can run on airflow
max_active_runs_per_dag = 1
# max threads used per worker / core
max_threads = 2
# 40G of RAM available total
# CPUs: 8 (sockets 4, cores per socket 4)
see https://www.astronomer.io/guides/airflow-scaling-workers/
Looking at the airflow-webserver.* logs nothing looks out of the ordinary, but looking at airflow-scheduler.out I see...
[airflow#airflowetl airflow]$ tail -n 20 airflow-scheduler.out
....
[2019-12-18 11:29:17,773] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table1 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,779] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table2 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,782] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table3 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:18,833] {scheduler_job.py:832} WARNING - Set 1 task instances to state=None as their associated DagRun was not in RUNNING state
[2019-12-18 11:29:18,844] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table4 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status success for try_number 1
....
but not really sure what to take away from this.
Anyone know what could be going on here or how to get more helpful debugging info?
Looking again at my lscpu specs, I noticed...
[airflow#airflowetl airflow]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Notice Thread(s) per core: 1
Looking at my airflow.cfg settings I see max_threads = 2. Setting max_threads = 1 and restarting both the scheduler seems to have fixed the problem.
If anyone knows more about what exactly is going wrong under the hood (eg. why the task fails rather than just waiting for another thread to become available), would be interested to hear about it.

doRedis return errors in windows 8 x64 for different Redis server build

I am using the Redis server from the link :
http://cloud.github.com/downloads/rgl/redis/redis-2.4.6-setup-64-bit.exe
with R version3.0.3, doRedis 1.1.0, rredis 1.6.8
The Redis worker end immediately after receiving jobs
> redisWorker('jobs')
Waiting for doRedis jobs.
Processing task for job 2 from queue jobs
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERR unknown command 'EVAL'
But with the Redis server from the link :
https://github.com/MSOpenTech/redis
and with Redis server build from source on cygwin,
The worker seems to be able to process job, but the master receive error
> redisWorker('jobs')
Waiting for doRedis jobs.
Processing task for job 9 from queue jobs
Processing task 1 ... from queue jobs jobID 9
Processing task for job 9 from queue jobs
Processing task 2 ... from queue jobs jobID 9
Processing task for job 9 from queue jobs
Processing task 3 ... from queue jobs jobID 9
> registerDoRedis('jobs')
> foreach(i = 1:3)%dopar%i
Error in i : task 1 failed - "object '.doRedisGlobals' not found"
I reported this issue to Bryan Lewis, the author of the doRedis and rredis packages. He replied that he is working to resolve the problem and will update the package on CRAN when it is fixed. In the meantime, you could downgrade to doRedis version 1.0.5 which doesn't have this problem.

Can only run uwsgi with root

I'm preparing to use nginx/uwsgi with flask for a website i'm developing, but i'm running into problems. NB the website itself runs great using flask's debug :5000 port, but i want to go into production now. To explain what i did.
It's a linode ubuntu 12.04LTS server, I installed it like this:
# install nginx
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:nginx/stable
sudo apt-get update
sudo apt-get upgrade --show-upgraded
sudo apt-get install nginx-full
# installing uwsgi
sudo apt-get install build-essential python-dev libxml2-dev
sudo apt-get install libc6 libexpat1 libgd2-xpm libgeoip1 libpam0g libpcre3 libssl1.0.0 libxml2 libxslt1.1 zlib1g
sudo pip install uwsgi
# python basics
sudo apt-get install python-pip build-essential python-dev
sudo pip install virtualenv
sudo pip install virtualenvwrapper
sudo mkdir -p /srv/www/li/
cd /srv/www/li/
virtualenv venv
source /srv/www/li/venv/bin/activate
pip install flask
Then i set out to configure everything, but I already run into trouble with uwsgi (never mind NGINX, which will be the next step.
sudo nano /etc/uwsgi/apps-available/li.xml
<uwsgi>
<plugin>python</plugin>
<socket>/run/uwsgi/app/li.socket</socket>
<chmod-socket>666</chmod-socket>
<chdir>/srv/www/li</chdir>
<pythonpath>/srv/www/li</pythonpath>
<virtualenv>/srv/www/li/venv</virtualenv>
<module>li</module>
<wsgi-file>/srv/www/li/li.py</wsgi-file>
<callable>app</callable>
<master/>
<processes>4</processes>
<harakiri>60</harakiri>
<reload-mercy>8</reload-mercy>
<cpu-affinity>1</cpu-affinity>
<stats>/tmp/stats.socket</stats>
<max-requests>2000</max-requests>
<limit-as>512</limit-as>
<reload-on-as>256</reload-on-as>
<reload-on-rss>192</reload-on-rss>
<no-orphans/>
<vacuum/>
</uwsgi>
sudo ln -s /etc/uwsgi/apps-available/li.xml /etc/uwsgi/apps-enabled/li.xml
However if i run it, i get:
uwsgi --xml /etc/uwsgi/apps-enabled/li.xml
[uWSGI] parsing config file /etc/uwsgi/apps-enabled/li.xml
open("./python_plugin.so"): No such file or directory [core/utils.c line 4755]
!!! UNABLE to load uWSGI plugin: ./python_plugin.so: cannot open shared object file: No such file or directory !!!
*** Starting uWSGI 1.4.6 (64bit) on [Thu Feb 28 16:30:53 2013] ***
compiled with version: 4.6.3 on 28 February 2013 12:38:22
os: Linux-3.7.10-x86_64-linode30 #1 SMP Wed Feb 27 14:29:31 EST 2013
nodename: demo
machine: x86_64
clock source: unix
detected number of CPU cores: 4
current working directory: /run/uwsgi/app
detected binary path: /usr/local/bin/uwsgi
your processes number limit is 63594
limiting address space of processes...
your process address space limit is 536870912 bytes (512 MB)
your memory page size is 4096 bytes
*** WARNING: you have enabled harakiri without post buffering. Slow upload could be rejected on post-unbuffered webservers ***
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
uwsgi socket 0 bound to UNIX address /run/uwsgi/app/li.socket fd 3
Python version: 2.7.3 (default, Aug 1 2012, 05:25:23) [GCC 4.6.3]
Set PythonHome to /srv/www/li/venv
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0xa86e20
your server socket listen backlog is limited to 100 connections
mapped 362120 bytes (353 KB) for 4 cores
*** Operational MODE: preforking ***
added /srv/www/li/ to pythonpath.
/srv/www/li/venv/local/lib/python2.7/site-packages/mongoengine/fields.py:744: FutureWarning: ReferenceFields will default to using ObjectId strings in 0.8, set DBRef=True if this isn't desired
warnings.warn(msg, FutureWarning)
WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0xa86e20 pid: 14934 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 14934)
spawned uWSGI worker 1 (pid: 14940, cores: 1)
mapping worker 1 to CPUs: 0
spawned uWSGI worker 2 (pid: 14941, cores: 1)
mapping worker 2 to CPUs: 1
spawned uWSGI worker 3 (pid: 14942, cores: 1)
mapping worker 3 to CPUs: 2
spawned uWSGI worker 4 (pid: 14943, cores: 1)
unlink(): Operation not permitted [core/socket.c line 109]
bind(): Address already in use [core/socket.c line 141]
...brutally killing workers...
mapping worker 4 to CPUs: 3
VACUUM: unix socket /run/uwsgi/app/li.socket removed.
So i get the unlink operation not permitted and the bind address already in use errors (next to the python_plugin error of which i also have no clue how to solve that!). If i run as sudo, it seems to work fine ->
sudo uwsgi --xml /etc/uwsgi/apps-enabled/li.xml
[uWSGI] parsing config file /etc/uwsgi/apps-enabled/li.xml
open("./python_plugin.so"): No such file or directory [core/utils.c line 4755]
!!! UNABLE to load uWSGI plugin: ./python_plugin.so: cannot open shared object file: No such file or directory !!!
*** Starting uWSGI 1.4.6 (64bit) on [Thu Feb 28 15:47:41 2013] ***
compiled with version: 4.6.3 on 28 February 2013 12:38:22
os: Linux-3.7.10-x86_64-linode30 #1 SMP Wed Feb 27 14:29:31 EST 2013
nodename: demo
machine: x86_64
clock source: unix
detected number of CPU cores: 4
current working directory: /run/uwsgi
detected binary path: /usr/local/bin/uwsgi
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
your processes number limit is 63594
limiting address space of processes...
your process address space limit is 536870912 bytes (512 MB)
your memory page size is 4096 bytes
*** WARNING: you have enabled harakiri without post buffering. Slow upload could be rejected on post-unbuffered webservers ***
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
uwsgi socket 0 bound to UNIX address /run/uwsgi/app/li.socket fd 3
Python version: 2.7.3 (default, Aug 1 2012, 05:25:23) [GCC 4.6.3]
Set PythonHome to /srv/www/li/venv
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x1fc9d00
your server socket listen backlog is limited to 100 connections
mapped 362120 bytes (353 KB) for 4 cores
*** Operational MODE: preforking ***
added /srv/www/li/ to pythonpath.
/srv/www/li/venv/local/lib/python2.7/site-packages/mongoengine/fields.py:744: FutureWarning: ReferenceFields will default to using ObjectId strings in 0.8, set DBRef=True if this isn't desired
warnings.warn(msg, FutureWarning)
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x1fc9d00 pid: 14755 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 14755)
spawned uWSGI worker 1 (pid: 14761, cores: 1)
mapping worker 1 to CPUs: 0
spawned uWSGI worker 2 (pid: 14762, cores: 1)
mapping worker 2 to CPUs: 1
spawned uWSGI worker 3 (pid: 14763, cores: 1)
mapping worker 3 to CPUs: 2
spawned uWSGI worker 4 (pid: 14764, cores: 1)
*** Stats server enabled on /tmp/stats.socket fd: 16 ***
mapping worker 4 to CPUs: 3
Can anyone please help me? As www-data is in the www-data group and he runs it, I tried some stuff:
sudo usermod -a -G www-data $USER
sudo chown -R $USER:www-data /srv/www/li
sudo chmod -R g+r+w+x /srv/www/li
sudo chown -R $USER:www-data /etc/uwsgi/apps-enabled
sudo chmod -R g+r+w+x /etc/uwsgi/apps-enabled
sudo chown -R $USER:www-data /run/uwsgi/app
sudo chmod -R g+r+w+x /run/uwsgi/app
But that really didn't help either. I also tried a tcp port instead of the unix /run/uwsgi/app/ port that didn't make any difference either...
This is driving me crazy :( I hope someone has a clue on what's happening here.
Kind regards,
Carst
edit: after a server restart it still gives an erro but a different one:
geoadmin#demo:~$ uwsgi --xml /etc/uwsgi/apps-enabled/li.xml
[uWSGI] parsing config file /etc/uwsgi/apps-enabled/li.xml
*** Starting uWSGI 1.4.6 (64bit) on [Thu Feb 28 18:47:36 2013] ***
compiled with version: 4.6.3 on 28 February 2013 12:38:22
os: Linux-3.7.10-x86_64-linode30 #1 SMP Wed Feb 27 14:29:31 EST 2013
nodename: demo
machine: x86_64
clock source: unix
detected number of CPU cores: 4
current working directory: /home/geoadmin
detected binary path: /usr/local/bin/uwsgi
your processes number limit is 63594
limiting address space of processes...
your process address space limit is 536870912 bytes (512 MB)
your memory page size is 4096 bytes
*** WARNING: you have enabled harakiri without post buffering. Slow upload could be rejected on post-unbuffered webservers ***
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
bind(): No such file or directory [core/socket.c line 141]
This was consistently my #1 result on google, and this page was relatively unhelpful to me, so I'm going to add my answer, even though it's fairly obvious in retrospect.
My problem was a permissions problem with my stats socket. If you change your uWSGI config's uid or gid parameters, make sure you either chmod or rm all of your old sockets/ pids, and their parent folders.
In my case I was trying to place the .sock file in the /vagrant directory, which is a machine mounted folder of virtual box and is not good for much more than reads and writes.
Place the .sock file outside of a virtualbox mount-point preferably in /tmp The FHS says: /var/run
Ref:
https://stackoverflow.com/a/7580524/1695680
Ok, after the later edit i checked the directories and the socket directory didnt exist (anymore); i think it had to do with the original apt-get install versus my later pip install... still have the issue with the python plugin but will check if it's necessary for nginx or if it will work without it... 8 hours of work over a reset, d'oh ;)
#bearrito:
In the end I put the socket in the tmp directory to avoid rights issues:
<uwsgi>
<uid>www-data</uid>
<gid>www-data</gid>
<plugin>python</plugin>
<socket>/tmp/li.socket</socket>
<chmod-socket>666</chmod-socket>
<chdir>/srv/www/li</chdir>
<pythonpath>/srv/www/li</pythonpath>
<virtualenv>/srv/www/li/venv</virtualenv>
<module>li</module>
<wsgi-file>/srv/www/li/li.py</wsgi-file>
<callable>app</callable>
<master/>
<processes>2</processes>
<pidfile>/tmp/li.pid</pidfile>
<harakiri>120</harakiri>
<reload-mercy>8</reload-mercy>
<cpu-affinity>1</cpu-affinity>
<stats>/tmp/stats.socket</stats>
<max-requests>2000</max-requests>
<limit-as>2048</limit-as>
<reload-on-as>2048</reload-on-as>
<reload-on-rss>1024</reload-on-rss>
<no-orphans/>
<vacuum/>
</uwsgi>
I hope this helps!
For me the solution was to remove /var/run/uwsgi/.sock and
chmod 775 /var/run/uwsgi
chmod 777 /var/log/uwsgi
or wherever your uwsgi files are.

Resources