whenever I execute this command
mpiexec -hosts 2 machin1(ip address) 1 machin2(ip address) 1 mpiprogram.exe
I get this error
aborting:Access denied by node
a common cause:this node is a resource managed by the computer cluster scheduler and mpiexec was attempting to use it without the a scheduled job
but I don't know what is the problem?what is job scheduler?and what should I do to resolve the problem?
I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
I am trying to run a simple MPI job across multiple hosts of a cluster.
[capc#gpu6 mpi_tests]$ /opt/openmpi4.0.3/build/bin/mpirun --host gpu7,gpu6 ./a.out
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu7
We have 2 processes.
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: gpu6
PID: 29209
[gpu6:29203] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[gpu6:29203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I have compiled the MPI program with mpicc and on running with mpirun it hangs.
Can anyone guide me regarding this?
I have a server.c program that is initialising a message queue with the following permissions:
#define SERVER "/serverqueue"
struct mq_attr attr;
attr.mq_flags = 0;
attr.mq_maxmsg = MAX_MSGS;
attr.mq_msgsize = MAX_MSG_SIZE;
attr.mq_curmsgs = 0;
server = mq_open(SERVER, O_RDWR | O_CREAT, 666, &attr)
In the first run, the mq_open() is successful and the program exits with no error. On subsequent executions, I get Permission denied errors at mq_open(). Why is this happening?
In case its relevant, I am not explicitly closing/unlinking the message queue descriptors as the OS does that automatically when the program exits, if i am not wrong
Message queues persist after process exit. The reason the second creation attempt fails is because you specify the mode as 666, which results rather strange permissions:
$ ls -l /dev/mqueue/serverqueue
--w--wx--T. 1 fw fw 80 Feb 17 13:13 serverqueue
There are no read permissions, so opening with O_RDWR fails.
Furthermore, since the queue names are a shared resource, it usually results in a security vulnerability if you create queues with O_CREAT instead of O_CREAT | O_EXCL. Another user could have created the same queue, with different permissions, and thus gain access to what you are trying to do with the queue.
I was in riak-shell when ssh lost its connection to the server. After reconnecting, I do the following:
sudo riak-shell
and get:
An instance of riak-shell is already running
So, I restarted the riak node in question. This did not seem to solve the problem. I do not see anything using ps -aux to kill. According to the docs, only one instance can run at a time. That makes sense, but when I run riak-shell from another node and try to connect to any node, I now get the following:
Error: invalid function call : connection_EXT:connect ["riak#<<<ip_address_elided>>>"]
You can connect to a specific node (whether in your riak_shell.config
or not) by typing 'connect "dev1#";' substituting your
node name for dev1.
You may need to change the Erlang cookie to do this.
See also the 'reconnect' command.
Unhandled message received is {#Ref<>,disconnected}
I have not changed the cookies during this process, and the cookie appears to be the same (at least in /etc/riak/riak_shell.config). (I am running the Riak TS AMI on AWS.)
riak-shell runs in its own Erlang VM - entirely separate from the riak node
(You don't need to run riak-shell from the machine your node is on - it uses the normal riak-erlang-client to talk to riak)
If you you are on a Linux do ps aux | grep riak_shell_app it will give you the process number you need to kill that instance:
08:30:45:~ $ ps aux | grep riak_shell_app
vagrant 4671 0.0 0.3 493260 34884 pts/4 Sl+ Aug17 0:03 /home/vagrant/riak_ee/dev/dev1/erts-5.10.3/bin/beam.smp -- -root /home/vagrant/riak_ee/dev/dev1 -progname erl -- -home /home/vagrant -- -boot /home/vagrant/riak_ee/dev/dev1/releases/2.1.1/start_clean -run riak_shell_app boot debug_off /home/vagrant/riak_ee/dev/dev1/bin/../log/riak_shell/riak_shell -noshell -config /home/vagrant/riak_ee/dev/dev1/bin/../etc/riak
I wrote a good chunk of it so let me know how you got on:
I tried to install oozie on my pc and looks like successfully installed
oozie admin -oozie http://localhost:11000/oozie -status
System mode: NORMAL
But when running an oozie job it is showing below error
Error: IO_ERROR : java.net.UnknownHostException: master
Can you please suggest what could be the reason?
The reason for this failure could be further examined in oozie.log, which usually located in /var/log/oozie folder for CDH distribution. Or the log location could be examined with command:
ps -ef | grep oozie
and look for "-Doozie.log.dir=..."
And the oozie host needs to be reachable from the host where the command line is invoked. Try telnet to oozie port to make sure connection is good. An example session is like this:
[cloudera#localhost hadoop-yarn]$ telnet master 11000
Connected to master.
Escape character is '^]'.