mpirun error of oneAPI with Slurm (and PBS) in old cluster - mpi

Recently I installed Intel OneAPI including c compiler, FORTRAN compiler and mpi library and complied VASP with it.
Before presenting the question, there are some tricks I need to clarify during the installation of VASP:
GLIBC2.14: the cluster is an old machine with a glibc version of 2.12, where OneAPI needs a version of 2.14. So I compile the GLIBC2.14 and export the ld_path: export LD_LIBRARY_PATH="~/mysoft/glibc214/lib:$LD_LIBRARY_PATH"
ld 2.24: The ld version is 2.20 in the cluster, while a higher version is needed. So I installed binutils 2.24.
There is one master computer connected with 30 calculating nodes in the cluster. The calculation is executed with 3 ways:
When I do the calculation in the master, it's totally OK.
When I login the nodes manually with rsh command, the calculation in the logged node is also no problem.
But usually I submit the calculation script from the master (with slurm or pbs), and then do the calculation in the node. In that case, I met following error message:
[mpiexec#node3.alineos.net] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec#node3.alineos.net] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec#node3.alineos.net] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec#node3.alineos.net] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec#node3.alineos.net] Possible reasons:
[mpiexec#node3.alineos.net] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec#node3.alineos.net] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec#node3.alineos.net] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec#node3.alineos.net] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I only met this error with oneAPI compiled codes but Intel® Parallel Studio XE compiled. Do you have any idea of this error? Your response will be highly appreciated.
Best,
Léon

Could it be a permissions error with the Slurm agent not having the correct permissions or library path?

Related

HCI_UART on NRF52840, attaching the device on a Yocto based Linux SBC errors out saying "Can't init device hci0: Cannot assign requested address (99)"

I am trying to Interface a BLE module based on Nordic's nrf52840 to a Yocto based SBC, to which all the BlueZ related packages have been added.
I have flashed Zephyr's sample hci_uart program onto the module. The module seems to run perfectly on my Linux PC (BlueZ version 5.48), whereas on the SBC(BlueZ version 5.54) it fails to get inited. Here's the error that comes when I use
root#rb-imx6:~# hciconfig hci0 up
root#rb-imx6:~# Can't init device hci0: Cannot assign requested address (99)
Can anyone please help me out on this?
Thanks in advance.
The error of assigning an address is caused by missing Linux kernel configuration options:
CONFIG_CRYPTO_USER
CONFIG_CRYPTO_USER_API
CONFIG_CRYPTO_USER_API_AEAD
CONFIG_CRYPTO_USER_API_HASH
CONFIG_CRYPTO_AES
CONFIG_CRYPTO_CCM
CONFIG_CRYPTO_AEAD
CONFIG_CRYPTO_CMAC
This is likely to happen with a self-built Buildroot or Yocto Embedded Linux system. If you run into this error, you should enable above options and recompile the kernel.
See the BlueZ requirements here: https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/README#n64
To see detailed debug output from BlueZ, run it with -d option:
bluetoothd -d

ORTE problem when running MPI on multiple computing nodes

I am trying to run a simple MPI example on a cluster with multiple computing nodes. Now I am just using two test nodes, including gpu8 and gpu12.
What I've done include:
gpu8 and gpu12 have the correct MPI environment (OpenMPI-4.0.1). I can successfully run the MPI example on a single node.
Passwordless login between gpu8 and gpu12 has been setup. They can ssh to another node with no issues.
There is a hostfile on each node containing
gpu8
gpu12
The executable files are under the same path.
echo $PATH (on both nodes) gives
/home/user_1/share/local/openmpi-4.0.1/bin:xxxxxx
echo $LD_LIBRARY_PATH (on both nodes) gives
/home/t716/shshi/share/local/openmpi-4.0.1/lib:
The ORTE problem:
I am running mpirun -np 2 --hostfile /home/user_2/hosts ./home/user_2/mpi-hello-world/mpi_hello_world. The error output is:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------

Unable to communicate with the runtime for 'R' script in SQL Server 2017

I'm having trouble getting R to work on SQL Server 2017 on one server (I've successfully installed it on about 8 other servers). I've already installed that latest cumulative update.
When I execute a stored procedure that runs a simple hello world R script, I can see that LaunchPad.exe and rterm.exe are both running. After 60 seconds, however, I get the following error:
Msg 39012, Level 16, State 1, Line 0
Unable to communicate with the runtime for 'R' script. Please check the requirements of 'R' runtime.
STDERR message(s) from external script: Fatal error: creation of tmpfile failed -- set TMPDIR suitably?
This is the script that fails:
EXEC sp_execute_external_script
#language =N'R', #script=N'print("hello")';
Any ideas on what I need to do to resolve this error?
The problem was that Named Pipes wasn't enabled for SQL Server. Enabling that, and restarting the services solved my issue.
My assumption is that you applied the CU after the installation of Machine Learning Services? If so, the CU somehow messes up the folder permissions.
I wrote a blog post about how to fix it here. The blog post is about CU7, but it should apply to any CU.
I do not guarantee that it works, as I have seen other issues when the ML Services stop working, for those cases what fixes it is to do a repair of the SQL installation.

dbm error only when submitting python job in Slurm

I am running a python code on a remote machine. When I run it on the head node of the computer, it executes with no problem.
But when I use Slurm workload manager:
sbatch --wrap="python mycode.py" -N 1 --cpus-per-task=8 -o mycode.o
Then the code fails with the following error (only showing the end of the error):
.
.
line 91, in open
"available".format(result))
dbm.error: db type is dbm.gnu, but the module is not available
I'm just confused how a code could run fine without submitting through Slurm, but fail when I do use Slurm.
The compute (remote) nodes probably don't have the same software installed as the head node, or you may need to do some configuration steps before running. Check with the administrator of the cluster.

Hostapd start error

make wthis this manual
http://wireless.kernel.org/en/users/Documentation/hostapd
with this config file
interface=wlan2
bridge=br0
driver=nl80211
ssid=SupaAP
country_code=RU
hw_mode=g
channel=5
preamble=1
macaddr_acl=0
auth_algs=1
logger_syslog=-1
logger_syslog_level=3
logger_stdout=-1
logger_stdout_level=2
ignore_broadcast_ssid=0
ieee80211n=1
ht_capab=[SHORT-GI-20][RX-STBC1]
wmm_enabled=1
and i have error
oot#Cubian:/home/cubie/wif/hostapd-2.1/hostapd# ./hostapd /etc/hostapd/hostapd.conf
Configuration file: /etc/hostapd/hostapd.conf
Line 16: unknown configuration item 'ieee80211n'
Line 17: unknown configuration item 'ht_capab'
2 errors found in configuration file '/etc/hostapd/hostapd.conf'
Failed to set up interface with /etc/hostapd/hostapd.conf
Failed to initialize interfac
in old version i dont have thsi error
It appears that the problem is that hostapd 2.1 is now treating an error it had differently.
This is reproducible on Ubuntu desktop versions, as the resource (WLAN) is busy.
If one turns off the programs that are accessing the resource, hostapd has a chance to grab it and work.
In Ubuntu desktop 14.04 beta, a solution is to turn off the programs that are using the wlan in question.
This worked for me:
sudo nmcli nm wifi off
sudo rfkill unblock wlan
then hostapd can start normally from command line. Of course, if you want hostapd to start on boot you must insure that the network manager is not grabbing the resource ahead of time.
SOLVED
in compile conf i uncomment string
CONFIG_IEEE80211N=y
I have found this link explaining the cause of the error and how to solve it.
https://bugs.archlinux.org/task/20269
Also I have found this other link http://blog.mirjamali.com/en/IT/Linux/hostapd which I followed and it is working with me check the configuration part.
My OS is ubuntu 12.4
start your hostapd with
hostapd -dd /etc/hostapd/hostapd.conf
you will get more information for diagnostics. In which old version you did not get the error? Did you update hostapd? Or changed your wifi adapter? which distri you are using?
http://hostap.epitest.fi/cgit/hostap/plain/hostapd/hostapd.conf
there is written how the ieee80211n settings are. for me i cant see an error.
if you remove the both lines, the ap is starting correctly?

Resources