Running MPI on LAN Cluster with different usernames

Running MPI on LAN Cluster with different usernames - mpi

I have two machines with different usernames: assume user1#master and user2#slave. I would like to run a MPI job on the two machines, but I have been unsuccessful until now. I have successfully setup passwordless ssh between the two machines. Both machines have the same version of OpenMPI and both machines have the PATH and LD_LIBRARY_PATH setup correspondingly.
The path for openmpi on each machine is /home/$USER/.openmpi and the program I want to run is inside ~/folder
My /etc/hosts file on both machines:
master x.x.x.110
slave x.x.x.111
My /.ssh/config file on user1#master:
Host slave
User user2
I then execute the command on user1#master while inside ~/folder as follows:
$ mpiexec -n 1 ./program : -np 1 -host slave -wdir /home/user2/folder ./program
I get the following error:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
Edits
If I use a hostfile with contents:
localhost
user2#slave
along with the --mca argument I get the following error:
$ mpirun --mca plm_base_verbose 10 -n 5 --hostfile hosts.txt ./program
[user:29277] mca: base: components_register: registering framework plm components
[user:29277] mca: base: components_register: found loaded component slurm
[user:29277] mca: base: components_register: component slurm register function successful
[user:29277] mca: base: components_register: found loaded component isolated
[user:29277] mca: base: components_register: component isolated has no register or open function
[user:29277] mca: base: components_register: found loaded component rsh
[user:29277] mca: base: components_register: component rsh register function successful
[user:29277] mca: base: components_open: opening plm components
[user:29277] mca: base: components_open: found loaded component slurm
[user:29277] mca: base: components_open: component slurm open function successful
[user:29277] mca: base: components_open: found loaded component isolated
[user:29277] mca: base: components_open: component isolated open function successful
[user:29277] mca: base: components_open: found loaded component rsh
[user:29277] mca: base: components_open: component rsh open function successful
[user:29277] mca:base:select: Auto-selecting plm components
[user:29277] mca:base:select:( plm) Querying component [slurm]
[user:29277] mca:base:select:( plm) Querying component [isolated]
[user:29277] mca:base:select:( plm) Query of component [isolated] set priority to 0
[user:29277] mca:base:select:( plm) Querying component [rsh]
[user:29277] mca:base:select:( plm) Query of component [rsh] set priority to 10
[user:29277] mca:base:select:( plm) Selected component [rsh]
[user:29277] mca: base: close: component slurm closed
[user:29277] mca: base: close: unloading component slurm
[user:29277] mca: base: close: component isolated closed
[user:29277] mca: base: close: unloading component isolated
[user:29277] *** Process received signal ***
[user:29277] Signal: Segmentation fault (11)
[user:29277] Signal code: (128)
[user:29277] Failing at address: (nil)
[user:29277] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f4226242f20]
[user:29277] [ 1] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x7f422629b207]
[user:29277] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup_function+0x10a)[0x7f422634d06a]
[user:29277] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup+0x3d)[0x7f422634d19d]
[user:29277] [ 4] /lib/x86_64-linux-gnu/libc.so.6(getpwuid_r+0x2f3)[0x7f42262e7ee3]
[user:29277] [ 5] /lib/x86_64-linux-gnu/libc.so.6(getpwuid+0x98)[0x7f42262e7498]
[user:29277] [ 6] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x477d)[0x7f422356977d]
[user:29277] [ 7] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x67a7)[0x7f422356b7a7]
[user:29277] [ 8] /home/.openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdc9)[0x7f4226675749]
[user:29277] [ 9] mpirun(+0x1262)[0x563fde915262]
[user:29277] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4226225b97]
[user:29277] [11] mpirun(+0xe7a)[0x563fde914e7a]
[user:29277] *** End of error message ***
Segmentation fault (core dumped)
I do not get any ssh orte info as asked but maybe becuase i am mistyping the --mca command?

Related

Number of slaves 0 when I mpirun my R code that test rmpi

After some trials, I was able to install Rmpi package on my computer using the following code:
R CMD INSTALL -l /storage/home/***/.R Rmpi_0.6-7.tar.gz --configure-args="--with-Rmpi-type=OPENMPI --disable-dlopen --with-Rmpi-include=/gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/include --with-Rmpi-libpath=/gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/lib"
I tried to run the following test code:
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
library("Rmpi")
}
ns <- mpi.universe.size() - 1
mpi.spawn.Rslaves(nslaves=ns)
#
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
if (is.loaded("mpi_initialize")){
if (mpi.comm.size(1) > 0){
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}
# Tell all slaves to return a message identifying themselves
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( ns <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Test computations
x <- 5
x <- mpi.remote.exec(rnorm, x)
length(x)
x
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves(dellog = FALSE)
mpi.quit()
On my HPC I run the following:
qsub -A open -l walltime=6:00:00 -l nodes=4:ppn=4:stmem -I
module use /gpfs/group/RISE/sw7/modules
module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
mpirun -np 4 Rscript "codes/test/test4.R"
But then I get the following error indicating that I only have 1 number of slaves:
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
I have tried specifying different number of np's but still get the same error. What could be the cause here?
============================================================
(EDIT)
It seems that my original command to load the modules also load intel/19.1.2 and mkl/2020.3. If I unload them, I do see that OMPI_UNIVERSE_SIZE=4.
[****#comp-sc-0220 work]$ module purge
[****#comp-sc-0220 work]$ module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
[****#comp-sc-0220 work]$ module list
Currently Loaded Modules:
1) openmpi/4.1.4-gcc.9.3.1 2) intel/19.1.2 3) mkl/2020.3 4) r/4.0.3
[****#comp-sc-0220 work]$ mpirun -np 4 env | grep OMPI_UNIVERSE_SIZE
[****#comp-sc-0220 work]$ type mpirun; mpirun --version; mpirun -np 1 env | grep OMPI
mpirun is /opt/aci/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun
Intel(R) MPI Library for Linux* OS, Version 2019 Update 8 Build 20200624 (id: 4f16ad915)
Copyright 2003-2020, Intel Corporation.
LMOD_FAMILY_COMPILER_VERSION=19.1.2
LMOD_FAMILY_COMPILER=intel
[****#comp-sc-0220 work]$ module purge
[****#comp-sc-0220 work]$ module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
[****#comp-sc-0220 work]$ module unload intel mkl
[****#comp-sc-0220 work]$ module list
Currently Loaded Modules:
1) openmpi/4.1.4-gcc.9.3.1 2) r/4.0.3
[****#comp-sc-0220 work]$ mpirun -np 4 env | grep OMPI_UNIVERSE_SIZE
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
[****#comp-sc-0220 work]$ type mpirun; mpirun --version; mpirun -np 1 env | grep OMPI
mpirun is /gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/bin/mpirun
mpirun (Open MPI) 4.1.4
Report bugs to http://www.open-mpi.org/community/help/
OMPI_MCA_pmix=^s1,s2,cray,isolated
OMPI_COMMAND=env
OMPI_MCA_orte_precondition_transports=954e2ae0a9569e46-2223294369d728a3
OMPI_MCA_orte_local_daemon_uri=4134338560.0;tcp://10.102.201.220:58039
OMPI_MCA_orte_hnp_uri=4134338560.0;tcp://10.102.201.220:58039
OMPI_MCA_mpi_oversubscribe=0
OMPI_MCA_orte_app_num=0
OMPI_UNIVERSE_SIZE=4
OMPI_MCA_orte_num_nodes=1
OMPI_MCA_shmem_RUNTIME_QUERY_hint=mmap
OMPI_MCA_orte_bound_at_launch=1
OMPI_MCA_ess=^singleton
OMPI_MCA_orte_ess_num_procs=1
OMPI_COMM_WORLD_SIZE=1
OMPI_COMM_WORLD_LOCAL_SIZE=1
OMPI_MCA_orte_tmpdir_base=/tmp
OMPI_MCA_orte_top_session_dir=/tmp/ompi.comp-sc-0220.26954
OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.comp-sc-0220.26954/pid.8212
OMPI_NUM_APP_CTX=1
OMPI_FIRST_RANKS=0
OMPI_APP_CTX_NUM_PROCS=1
OMPI_MCA_initial_wdir=/storage/work/k/****
OMPI_MCA_orte_launch=1
OMPI_MCA_ess_base_jobid=4134338561
OMPI_MCA_ess_base_vpid=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_MCA_orte_ess_node_rank=0
OMPI_FILE_LOCATION=/tmp/ompi.comp-sc-0220.26954/pid.8212/0/0
But if I run the same test4.R again, I get the following error:
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[63743,1],0]
Exit code: 127
--------------------------------------------------------------------------
============================================================
(EDIT 2)
I changed my module load command again to module load openmpi/4.1.4-gcc.9.3.1 r/4.0.5-gcc-9.3.1. With this newer version of R I ran my test4.R script again with mpirun -np 4 Rscript "codes/test/test4.R". It is now returning a new error message as follows:
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] 4
[1] 4
[1] 4
[1] 4
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[62996,1],1]
Exit code: 1
--------------------------------------------------------------------------

Install the package pbdMPI in an R session on the login node and run the following translation of the Rmpi test code into the use of pbdMPI:
library(pbdMPI)
ns <- comm.size()
# Tell all R sessions to return a message identifying themselves
id <- comm.rank()
ns <- comm.size()
host <- system("hostname", intern = TRUE)
comm.cat("I am", id, "on", host, "of", ns, "\n", all.rank = TRUE)
# Test computations
x <- 5
x <- rnorm(x)
comm.print(length(x))
comm.print(x, all.rank = TRUE)
finalize()
You run it the same way you used for the Rmpi version: mpirun -np 4 Rscript your_new_script_file.
Spawning MPI (as in the Rmpi example) was appropriate when running on clusters of workstations but on an HPC cluster the prevalent way to program with MPI is SPMD - single program multiple data. SPMD means that your code is a generalization of a serial code that is able to have several copies of itself cooperate with each other.
In the above example, cooperation happens only with printing (the comm... functions). There is no manager/master, just several R sessions running the same code (usually computing something different based on comm.rank()) and cooperating/communicating via MPI. This is the prevalent way of large scale parallel computing on HPC clusters.

Openmpi 4.0.5 fails to distribute tasks to more than 1 node

We are having trouble with openmpi 4.0.5 on our cluster: It works as long as only 1 node is requested, but as soon as more than 1 is requested (e.g. mpirun -np 24 ./hello_world with --ntasks-per-node=12) it crashes and we get the following error message:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:
./hello_world
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
I have tried using --oversubscribe, but this will still only use 1 node, even though smaller jobs would run that way. I have also tried specifically requesting nodes (e.g. -host node36,node37), but this results in the following error message:
[node37:16739] *** Process received signal ***
[node37:16739] Signal: Segmentation fault (11)
[node37:16739] Signal code: Address not mapped (1)
[node37:16739] Failing at address: (nil)
[node37:16739] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x2ac57d70e5f0]
[node37:16739] [ 1] /lib64/libc.so.6(+0x13ed5a)[0x2ac57da59d5a]
[node37:16739] [ 2] /usr/lib64/openmpi/lib/libopen-rte.so.12(orte_daemon+0x10d7)[0x2ac57c6c4827]
[node37:16739] [ 3] orted[0x4007a7]
[node37:16739] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac57d93d505]
[node37:16739] [ 5] orted[0x400810]
[node37:16739] *** End of error message ***
The cluster has 59 nodes. Slurm 19.05.0 is used as a scheduler and gcc 9.1.0 to compile.
I don't have much experience with mpi - any help would be much appreciated! Maybe someone is familiar with this error and could point me towards what the problem might be.
Thanks for your help,
Johanna

Failed to determine whether /sys is a mount point: Bad file descriptor

I have an old Xen machine that got hard rebooted. All of the guest domains recovered, except one (the important one, of course). This is what I get when I try to boot that domain:
[ 0.486626] xvda2: detected capacity change from 0 to 1073741824
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[ 0.782231] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.
[ 1.045040] systemd[1]: Failed to determine whether /sys is a mount point: Bad file descriptor
[ 1.045072] systemd[1]: Failed to determine whether /proc is a mount point: Bad file descriptor
[ 1.045091] systemd[1]: Failed to determine whether /dev is a mount point: Bad file descriptor
[!!!!!!] Failed to mount early API filesystems, freezing.
[ 1.046082] systemd[1]: Freezing execution.
I understand vaguely what it's saying: there's a problem with /sys, /proc, or /dev and so the VM can't boot.
However, in dom0 I can mount disk.img /mnt and browse all the contents, including those special directories.
I've tried using losetup to set up a loopback device to disk.img, and I fsck -f that device with no errors.
If this were a physical machine, I'd probably try booting into single user mode and examining the "physical" disk devices using fsck, but I'm not sure how to do that with Xen involved.
What am I missing here? What should I try next?

Why does using a pillar value in this salt environment fail with "... has no attribute ..."?

I have a new Debian (9.3) install with new salt-master (2017.7.4) and salt-minion installed. In /etc/salt/minion.d I have a conf file containing:
master: 127.0.0.1
grains:
roles:
- 'introducer'
In /srv/salt/top.sls I have:
base:
# https://docs.saltstack.com/en/latest/ref/states/top.html
'G#roles:introducer':
- 'introducer'
In /srv/pillar/data.sls I have:
introducer:
location: 'tcp:x.x.x.x:y'
port: 'tcp:y'
When I run salt '*' state.apply, I encounter this failure:
668629:
Data failed to compile:
----------
Rendering SLS 'base:introducer' failed: Jinja variable 'salt.pillar object' has no attribute 'introducer'
ERROR: Minions returned with non-zero exit code
Why isn't the pillar data available?

Pillar data requires a top definition as well. The configuration described in the question has no Pillar top.sls so no Pillar data is selected for any of the minions.
To correct this, add a top.sls to the Pillar directory which selects the desired minions and makes the data available to them. For example, this /srv/pillar/top.sls:
base:
'*':
- 'data'
This makes the contents of /srv/pillar/data.sls available to all minions (selected by *) in the base environment.

Riak 1.3.1 will not start on lucid, Ec2 instance

I have installed riak (apt-get) on an EC2 instance, lucid, amd64 with libssl.
When running riak start I get:
Attempting to restart script through sudo -H -u riak
Riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.
Running riak console:
Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot /usr/lib/riak/releases/1.3.1/riak
-embedded -config /etc/riak/app.config
-pa /usr/lib/riak/lib/basho-patches
-args_file /etc/riak/vm.args -- console
Root: /usr/lib/riak
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true]
/usr/lib/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed.
Erlang has closed
{"Kernel pid terminated",application_controller,"{application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}}"}
Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}})
The error logs:
2013-04-24 11:36:20.897 [error] <0.146.0> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: bad return value: {error,eaddrinuse} in gen_server:init_it/6 line 332
2013-04-24 11:36:20.899 [error] <0.145.0> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad return value: {error,eaddrinuse} in context start_error
2013-04-24 11:36:20.902 [error] <0.142.0> Supervisor riak_core_handoff_sup had child riak_core_handoff_listener_sup started with riak_core_handoff_listener_sup:start_link() at undefined exit with reason shutdown in context start_error
2013-04-24 11:36:20.903 [error] <0.130.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at undefined exit with reason shutdown in context start_error
I'm new to Riak and basically tried to run through the "Fast Track" docs.
None of the default core IP settings in the configs have been changed. They are still set to {http, [ {"127.0.0.1", 8098 } ]}, {handoff_port, 8099 }
Any help would be greatly appreciated.

I know this is old but there is some solid documentation about the errors in the crash.dump file on the Riak site.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Running MPI on LAN Cluster with different usernames - mpi

Related

Number of slaves 0 when I mpirun my R code that test rmpi

Openmpi 4.0.5 fails to distribute tasks to more than 1 node

Failed to determine whether /sys is a mount point: Bad file descriptor

Why does using a pillar value in this salt environment fail with "... has no attribute ..."?

Riak 1.3.1 will not start on lucid, Ec2 instance

Categories

Resources