Problems in executing mpi with machinefile Ubuntu 18.04 - mpi

following these guidelines MpichClusterUbuntu, I'm trying to execute my very first mpi program with a PC with Ubuntu 18.04.01 Server Edition and a laptop with Ubuntu 18.04.02 Desktop. Till step 11 of this guideline, everything went fine, with no problems at all.
I set up a machinefile called hosts
with these two lines:
192.168.1.7 # first 'master' node: the PC
192.168.1.5 # second node: the laptop
After compiling the very simple example file contained in the guidelines without:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv) {
int myrank, nprocs;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("Hello from processor %d of %d\n", myrank, nprocs);
MPI_Finalize();
return 0;
}
mpiu#pc01:~$ mpicc mpi_hello.c -o mpi_hello
Executing without considering the machinefile 'hosts', this is the output:
mpiu#pc01:~$ mpiexec -n 8 ./mpi_hello
------------------------------------------------------------------
[[27419,1],0]: A high-performance Open MPI point-to-point messaging
module was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: pc01
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
----------------------------------------------------------------
Hello from processor 1 of 8
Hello from processor 2 of 8
Hello from processor 5 of 8
Hello from processor 6 of 8
Hello from processor 0 of 8
Hello from processor 3 of 8
Hello from processor 7 of 8
Hello from processor 4 of 8
[pc01:25010] 7 more processes have sent help message help-mpi-btl-
base.txt / btl:no-nics
[pc01:25010] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
And when executing calling the machinefile 'hosts', the execution remains idle without producing any output:
mpiu#pc01:~$ mpiexec -n 8 -machinefile hosts ./mpi_hello
PS:
this is the content of /etc/netplan/50-cloud-init.yaml in the "master" node (PC):
network:
ethernets:
enp3s0:
#addresses: []
#dhcp4: true
addresses: [192.168.1.7/24]
gateway4: 192.168.1.1
nameservers:
addresses: [8.8.8.8,8.8.4.4]
dhcp4: no
version: 2
Updates:
after the correct comment of Gilles, I removed openmpi which I guess it was installed previously.
Now executing the step 11 of the guidelines MpichClusterUbuntu18.04 :
A) without calling the machinefile:
marco#pc01:/mirror$ mpiexec -n 8 ./mpi_hello
Hello from processor 0 of 8
Hello from processor 1 of 8
Hello from processor 3 of 8
Hello from processor 5 of 8
Hello from processor 6 of 8
Hello from processor 7 of 8
Hello from processor 2 of 8
Hello from processor 4 of 8
B) But calling the machinefile "hosts":
marco#pc01:/mirror$ mpiexec -n 8 -machinefile /home/mpiu/hosts
./mpi_hello
ssh: Could not resolve hostname pc0: Temporary failure in name resolution
ssh: Could not resolve hostname riccarcohp: Temporary failure in name
resolution
^C[mpiexec#pc01] Sending Ctrl-C to processes as requested
[mpiexec#pc01] Press Ctrl-C again to force abort
[mpiexec#pc01] HYDU_sock_write (utils/sock/sock.c:286): write error (Bad
file descriptor)
[mpiexec#pc01] HYD_pmcd_pmiserv_send_signal (pm/pmiserv
/pmiserv_cb.c:177): unable to write data to proxy
[mpiexec#pc01] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:79): unable to send
signal downstream
[mpiexec#pc01] HYDT_dmxu_poll_wait_for_event (tools/demux
/demux_poll.c:77): callback returned error status
[mpiexec#pc01] HYD_pmci_wait_for_completion (pm/pmiserv
/pmiserv_pmci.c:198): error waiting for event
[mpiexec#pc01] main (ui/mpich/mpiexec.c:340): process manager error
waiting for completion
After putting in machinefile 'hosts' only the IP Addresses:
mpiu#pc01:/mirror$ mpiexec -n 8 -machinefile /home/mpiu/hosts ./mpi_hello
Permission denied, please try again.
Permission denied, please try again.
mpiu#192.168.1.5: Permission denied (publickey,password).
But I can ssh with no problems at all from the PC to the laptop:
mpiu#pc01:/mirror$ ssh 192.168.1.5
mpiu#192.168.1.5's password:
mpiu#riccardo-HP-Laptop-15-da0xxx:~$
Now it seems SOLVED, even if I repeated for the third time, right the same procedure:
these are the steps I followed for setting up passwordless SSH between pc01 (the and riccardohp (laptop):
marco#pc01:/$ su - mpiu
Password:
mpiu#pc01:~$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/mpiu/.ssh/id_rsa):
Created directory '/home/mpiu/.ssh'.
To make it simpler, I left out the passphrase:
Your identification has been saved in /home/mpiu/.ssh/id_rsa.
Your public key has been saved in /home/mpiu/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:..... mpiu#pc01
The key's randomart image is:
+---[RSA 2048]----+
...................
...................
+----[SHA256]-----+
I copied the public key from pc01 to the laptop:
mpiu#pc01:~$ ssh-copy-id 192.168.1.5
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home
/mpiu/.ssh/id_rsa.pub"
The authenticity of host '192.168.1.5 (192.168.1.5)' can't be
established.
ECDSA key fingerprint is SHA256:.......................
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to
filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are
prompted now it is to install the new keys
mpiu#192.168.1.5's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh '192.168.1.5'"
and check to make sure that only the key(s) you wanted were added.
mpiu#pc01:~$ ssh '192.168.1.5'
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.18.0-16-generic x86_64)
mpiu#riccardo-HP-Laptop-15-da0xxx:~$
So, apparently, it seems that the ssh connection between pc01 and the laptop works fine...
mpiu#riccardo-HP-Laptop-15-da0xxx:~$ ^C
mpiu#riccardo-HP-Laptop-15-da0xxx:~$ logout
Connection to 192.168.1.5 closed.
mpiu#pc01:~$ cd /
mpiu#pc01:/$ cd mirror
mpiu#pc01:/mirror$ mpicc mpi_hello.c -o mpi_hello
gcc: error: mpi_hello.c: No such file or directory
mpiu#pc01:/mirror$ nano mpi_hello.c
mpiu#pc01:/mirror$ mpicc mpi_hello.c -o mpi_hello
mpiu#pc01:/mirror$ mpiexec -n 8 ./mpi_hello
Hello from processor 0 of 8
Hello from processor 1 of 8
Hello from processor 2 of 8
Hello from processor 3 of 8
Hello from processor 4 of 8
Hello from processor 5 of 8
Hello from processor 6 of 8
Hello from processor 7 of 8
I put in file hosts in /mirror:
192.168.1.7
192.168.1.5
mpiu#pc01:/mirror$ mpiexec -n 8 -machinefile hosts ./mpi_hello
Hello from processor 2 of 8
Hello from processor 4 of 8
Hello from processor 6 of 8
Hello from processor 0 of 8
Hello from processor 1 of 8
Hello from processor 3 of 8
Hello from processor 5 of 8
Hello from processor 7 of 8
Marco

Related

OpenMPI does not recognize multiple nodes?

I am trying to run a Julia script in paralell on a cluster.
The cluster uses Moab and Torque for the scheduler and resource manager.
Since SSH seems to be restricted, I use MPI for multiprocessing.
I throw the following job, requesting for 3 nodes:
#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l pmem=10gb
#PBS -l nodes=3:ppn=1
#PBS -j oe
#PBS -A open
#PBS -o (some path)
#PBS -e (some path)
cd (some path)
echo ""
echo "JOB Started on $(hostname -s) at $(date)"
echo ""
module purge
module use (some path)/modules
module load julia
module load openmpi
mpirun -np 3 -display-allocation julia --project=. "(some path)/test.jl"
echo ""
echo "JOB ended at $(date)"
But it if I look at the output script, it seems that it recognizes only one node, comp-bc-0384:
JOB Started on comp-bc-0384 at Sat Mar 19 22:05:12 EDT 2022
====================== ALLOCATED NODES ======================
comp-bc-0384: slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
[[12308,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: comp-bc-0384
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] 2 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
10.214858 seconds (116.21 k allocations: 6.110 MiB)
JOB ended at Sat Mar 19 22:05:36 EDT 2022
I was expecting the ALLOCATED NODES section to display the other node(s) I was assigned to.
A similar question in the past (openMPI/mpich2 doesn't run on multiple nodes) suggests that it has something to do with host file.
Therefore I also tried with mpirun -hostfile $PBS_NODEFILE -np 3 -display-allocation julia --project=. "(some path)/test.jl" . It then returns the following:
JOB Started on comp-bc-0384 at Sat Mar 19 22:16:15 EDT 2022
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
JOB ended at Sat Mar 19 22:16:16 EDT 2022
What could be the cause here?

I can't mount cephfs to my computer. How can i solve this problem?

I have a cephfs and I need to mount this file system.
I have two pools cephfs_data and cephfs_meta.
ceph -s output is:
cluster:
id: 9f3e7f80-4515-4b5f-92f0-4eb49f3cbf44
health: HEALTH_OK
services:
mon: 2 daemons, quorum mon1,osd0
mgr: osd0(active), standbys: mon1
mds: mycephfs-1/1/1 up {0=mon1=up:active}
osd: 1 osds: 1 up, 1 in
data:
pools: 3 pools, 72 pgs
objects: 24 objects, 35 KiB
usage: 1.1 GiB used, 837 GiB / 838 GiB avail
pgs: 72 active+clean
I created a user with this properties:
[client.foo]
key = AQA4d5xdlAklBxAA+Q5T+b3HLAxj2kRKzXUOSA==
caps mds = "allow r"
caps mon = "allow r"
caps osd = "allow rw tag cephfs data=mycephfs"
And when i try run this command:
sudo mount -t fuse.ceph conf=/etc/ceph/ceph.conf /mnt/cephfs/
this happens:
mount: /mnt/cephfs: wrong fs type, bad option, bad superblock on conf=/etc/ceph/ceph.conf, missing codepage or helper program, or other error.
or
when i try run this command:
sudo mount.ceph mon1:6789:/ /mnt/cephfs/
this happens:
mount error 110 = Connection timed out
or
when i try run this command:
sudo ceph-fuse -n client.foo /mnt/cephfs/
this happens:
ceph-fuse[64711]: starting ceph client
2019-10-21 16:21:17.329932 7f58cedbb500 -1 init, newargv = 0x55a6c11f0340 newargc=9
and indifinite pending. I can't see "starting fuse".
.
Where is my fault? Which way i should follow?
The syntax of your commands is incorrect.
You can mount the CephFS using
mount -t ceph mon1:6789:/ /mnt/ceph -o name=foo,secretfile=/path/to/keyring/file
There are many options you can use for the mount that can be found in the mount.ceph Documentation

Raspberry Pi sim900 Default Internet Access

I have set up a ITEAD sim900 GSM module to interface with raspberry pi. I believe I have established a gprs connection to AT&T though wvdial as I get these results.
--> WvDial: Internet dialer version 1.61
--> Initializing modem.
--> Sending: AT+CGDCONT=1,"IP","Broadband"
AT+CGDCONT=1,"IP","Broadband"
OK
--> Modem initialized.
--> Sending: ATDT*99#
--> Waiting for carrier.
ATDT*99#
CONNECT
--> Carrier detected. Starting PPP immediately.
--> Starting pppd at Thu Aug 14 05:49:20 2014
--> Pid of pppd: 2794
I have been looking all over the internet for some answers to a few questions that I have, but I can't seem to find any. Any help with the following questions will be greatly appreciated! Thanks!
I have three questions, and some may be stupid as I am VERY new to this field.
Am I actually connected to AT&T's GPRS network?
How can I make this module (serial port /dev/ttyAMA0) my default internet connection? What I mean is I want all internet traffic routed through this modem(web surfing, email etc.). I am connected to the Raspberry via ssh so I have to have either ethernet or wifi active to access the computer--I am currently using ethernet. After I connect through wvdial in the way shown above, and disable all other internet sources I have no access. It seems to still be looking to the active ethernet port for data(I could be wrong).
For my project I need to have the sim900 modem as the internet access point, but I also need to be able to connect to a LAN via wifi that has no internet access. Is this possible?
Finally i got the ( raspberrypi + ppp + gprs/gsm-modem ) working.
Some notes before start:
Make sure the power supply you used for raspberrypi is exact 5V and it can provide at-least 2A current without voltage drop-out.The SIM900 power-source must be 3.3V 2A
Set the SIM900 baud rate to 115200 via: AT+IPR=115200
Check the modem serial peripheral via: $ screen /dev/ttyAMA0 115200 type AT<enter> it will echo: OK. Hit ctrl+a k y to exit.
/etc/ppp/options-mobile
ttyAMA0
115200
lock
crtscts
modem
passive
novj
defaultroute
replacedefaultroute
noipdefault
usepeerdns
noauth
hide-password
persist
holdoff 10
maxfail 0
debug
Create the /etc/ppp/peers directory:
$ mkdir /etc/ppp/peers
$ cd /etc/ppp/peers
/etc/ppp/peers/mobile-auth
file /etc/ppp/options-mobile
user "your_usr"
password "your_pass"
connect "/usr/sbin/chat -v -t15 -f /etc/ppp/chatscripts/mobile-modem.chat"
/etc/ppp/peers/mobile-noauth
file /etc/ppp/options-mobile
connect "/usr/sbin/chat -v -t15 -f /etc/ppp/chatscripts/mobile-modem.chat"
Create the /etc/ppp/chatscripts directory:
$ mkdir /etc/ppp/chatscripts
/etc/ppp/chatscripts/mobile-modem.chat
ABORT 'BUSY'
ABORT 'NO CARRIER'
ABORT 'VOICE'
ABORT 'NO DIALTONE'
ABORT 'NO DIAL TONE'
ABORT 'NO ANSWER'
ABORT 'DELAYED'
REPORT CONNECT
TIMEOUT 6
'' 'ATQ0'
'OK-AT-OK' 'ATZ'
TIMEOUT 3
'OK' #/etc/ppp/chatscripts/pin
'OK\d-AT-OK' 'ATI'
'OK' 'ATZ'
'OK' 'ATQ0 V1 E1 S0=0 &C1 &D2 +FCLASS=0'
'OK' #/etc/ppp/chatscripts/mode
'OK-AT-OK' #/etc/ppp/chatscripts/apn
'OK' 'ATDT*99***1#'
TIMEOUT 30
CONNECT ''
/etc/ppp/chatscripts/my-operator-apn
AT+CGDCONT=1,"IP","<apn-name>"
/etc/ppp/chatscripts/pin.CODE
AT+CPIN=1234
/etc/ppp/chatscripts/pin.NONE
AT
/etc/ppp/chatscripts/mode.3G-only
AT\^SYSCFG=14,2,3fffffff,0,1
/etc/ppp/chatscripts/mode.3G-pref
AT\^SYSCFG=2,2,3fffffff,0,1
/etc/ppp/chatscripts/mode.GPRS-only
AT\^SYSCFG=13,1,3fffffff,0,0
/etc/ppp/chatscripts/mode.GPRS-pref
AT\^SYSCFG=2,1,3fffffff,0,0
The SYSCFG line in the mode.* files is device-dependent, and likely Huawei-specific, So You may use the mode.NONE file if your modem is
SIM900.
*
/etc/ppp/chatscripts/mode.NONE
AT
Make some symbolic links:
$ ln -s /etc/ppp/chatscripts/my-operator-apn /etc/ppp/chatscripts/apn
$ ln -s /etc/ppp/chatscripts/mode.NONE /etc/ppp/chatscripts/mode
$ ln -s /etc/ppp/chatscripts/pin.NONE /etc/ppp/chatscripts/pin
If you have to enter credentials use mobile-auth
$ mv provider provider.example
$ ln -s /etc/ppp/peers/mobile-noauth /etc/ppp/peers/provider
Check syslog in another console:
$ tail -f /var/log/syslog | grep -Ei 'pppd|chat'
Finally issue the pon command to see the result:
$ pon
The base instruction : https://wiki.archlinux.org/index.php/3G_and_GPRS_modems_with_pppd

MPI (OpenMPI) - MPI_Publish_name cannot contact global ompi-server and throws error

I am attempting to write an MPI application that would consist of programs in the server client mould. I am stuck trying to get the server to publish its name to the ompi-server in the global scope
Here is the server code:
int main(int argc, char** argv) {
int myrank, nprocs, errmpi;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
char port_name[MPI_MAX_PORT_NAME];
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "yes");
MPI_Open_port(info, port_name);
//Fails here
MPI_Publish_name("ServerName", info, port_name);
// Rest of code...
I get the following error on running it:
$ ./mpi/bin/mpirun -np 1 --mca btl self ServerName
--------------------------------------------------------------------------
Process rank 0 attempted to publish to a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently executing. If you did specify the contact info for a
server, please check to see that the server is running and start
it again (or have your sys admin start it) if it isn't.
--------------------------------------------------------------------------
[xxx:18205] *** An error occurred in MPI_Publish_name
[xxx:18205] *** reported by process [1424949249,139676631433216]
[xxx:18205] *** on communicator MPI_COMM_WORLD
[xxx:18205] *** MPI_ERR_INTERN: internal error
[xxx:18205] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[xxx:18205] *** and potentially your MPI job)
I do have the ompi-server process running in debug mode on console
$ ./ompi-server --no-daemonize -d -r +
[xxx:14140] [[9416,0],0] orte-server: up and running!
Ultimately I will distribute the processes across various nodes, but for now I would really like to get the framework working on a single node. Could someone please help? Thanks very much indeed!
EDIT 1: Thank you very much for your quick reply. I made the following changes
$mpi/bin/ompi-server --no-daemonize -d -r mpiuri
If I now run the program so, I find the program hangs at the point where it previously fails
$./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v Server
While if I run the program with the following,
$ ./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v --wait-for-server --server-wait-time 10 Server
With the following error
--------------------------------------------------------------------------
mpirun was instructed to wait for the requested ompi-server, but was unable to
establish contact with the server during the specified wait time:
Server uri: 799801344.0;tcp://192.168.1.113:44487
Timeout time: 10
Error received: Not supported
Please check to ensure that the requested server matches the actual server
information, and that the server is in operation.
--------------------------------------------------------------------------
I must be close... but I cant quite figure it
I am fairly sure it is not the firewall, since I added the rule ALLOW 192.168.1.0/24 to ufw
Here is how to connect with the ompi-server
1) Ensure that ompi server is up and running, and is writing its uri to a file with the following command
$mpi/bin/ompi-server --no-daemonize -d -r mpiuri
2) Start all the mpi processes with this uri file, ensuring that you
prefix the uri filename with "file:" when you enter the
--ompi-server parameter
enter the hostname of the the node where you are run mpirun ... like so
$./mpi/bin/mpirun --ompi-server file:mpiuri -host myHostName -np 1 -v Server

MPICH2 Hydra round robin on multicore

I need to schedule process in round robin order in my Mpi program.
I have a cluster with 8 nodes, each cluster with quad-core processor.
I use mpich2-1.4.1p1 version under Ubuntu Linux.
If using this machinefile :
node01
node02
node03
node04
node05
node06
node07
node08
and then run :
mpiexec -np 10 -machinefile host ./my-program
I have a right scheduling, rank 0 to node01, rank1 to node02, ... rank8 to node01 and finally rank 9 to node02
But I need to know if rank 0 and rank 8 run on same core or not. I need that rank 0 works on first core on node01 and rank 8 on second.
If I use a different machinefile :
node01:4
node02:4
node03:4
node04:4
node05:4
node06:4
node07:4
node08:4
and then run :
mpiexec -np 10 -machinefile host2 ./my-program
I have that rank 0,1,2,3 run on node01 . And isn't what I want.
How force Hydra to use round robin on node first and then on cores using this second machinefile ?

Resources