dbm error only when submitting python job in Slurm - python-3.6

I am running a python code on a remote machine. When I run it on the head node of the computer, it executes with no problem.
But when I use Slurm workload manager:
sbatch --wrap="python mycode.py" -N 1 --cpus-per-task=8 -o mycode.o
Then the code fails with the following error (only showing the end of the error):
.
.
line 91, in open
"available".format(result))
dbm.error: db type is dbm.gnu, but the module is not available
I'm just confused how a code could run fine without submitting through Slurm, but fail when I do use Slurm.

The compute (remote) nodes probably don't have the same software installed as the head node, or you may need to do some configuration steps before running. Check with the administrator of the cluster.

Related

mpirun error of oneAPI with Slurm (and PBS) in old cluster

Recently I installed Intel OneAPI including c compiler, FORTRAN compiler and mpi library and complied VASP with it.
Before presenting the question, there are some tricks I need to clarify during the installation of VASP:
GLIBC2.14: the cluster is an old machine with a glibc version of 2.12, where OneAPI needs a version of 2.14. So I compile the GLIBC2.14 and export the ld_path: export LD_LIBRARY_PATH="~/mysoft/glibc214/lib:$LD_LIBRARY_PATH"
ld 2.24: The ld version is 2.20 in the cluster, while a higher version is needed. So I installed binutils 2.24.
There is one master computer connected with 30 calculating nodes in the cluster. The calculation is executed with 3 ways:
When I do the calculation in the master, it's totally OK.
When I login the nodes manually with rsh command, the calculation in the logged node is also no problem.
But usually I submit the calculation script from the master (with slurm or pbs), and then do the calculation in the node. In that case, I met following error message:
[mpiexec#node3.alineos.net] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec#node3.alineos.net] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec#node3.alineos.net] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec#node3.alineos.net] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec#node3.alineos.net] Possible reasons:
[mpiexec#node3.alineos.net] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec#node3.alineos.net] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec#node3.alineos.net] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec#node3.alineos.net] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I only met this error with oneAPI compiled codes but Intel® Parallel Studio XE compiled. Do you have any idea of this error? Your response will be highly appreciated.
Best,
Léon
Could it be a permissions error with the Slurm agent not having the correct permissions or library path?

ORTE problem when running MPI on multiple computing nodes

I am trying to run a simple MPI example on a cluster with multiple computing nodes. Now I am just using two test nodes, including gpu8 and gpu12.
What I've done include:
gpu8 and gpu12 have the correct MPI environment (OpenMPI-4.0.1). I can successfully run the MPI example on a single node.
Passwordless login between gpu8 and gpu12 has been setup. They can ssh to another node with no issues.
There is a hostfile on each node containing
gpu8
gpu12
The executable files are under the same path.
echo $PATH (on both nodes) gives
/home/user_1/share/local/openmpi-4.0.1/bin:xxxxxx
echo $LD_LIBRARY_PATH (on both nodes) gives
/home/t716/shshi/share/local/openmpi-4.0.1/lib:
The ORTE problem:
I am running mpirun -np 2 --hostfile /home/user_2/hosts ./home/user_2/mpi-hello-world/mpi_hello_world. The error output is:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------

Terminate Spring Cloud Task on OutOfMemory exception

I have a Spring task app deployed on PCF. This app get OutOfMemory exception but not terminate the task.
Many people suggested setting env -XX:OnOutOfMemoryError="kill -9 %p" solve this problem. How can I set it on PCF?
When you run an app on Cloud Foundry, the Java buildpack will run and emit a start command which includes a Java agent that properly handles this for you. It's called the jvmkill agent.
https://github.com/cloudfoundry/jvmkill#overview
This will monitor your app for OOME's and if one happens, print some debug info and kill the app. I believe this is exactly the behavior that you're discussing above, but unlike the way you mentioned this method will print debug info prior to killing the app and IMHO is generally more reliable.
For tasks running on Cloud Foundry, the Java buildpack still installs the kill agent, but it cannot actually insert the kill agent into the start command for your task. This is because CF tasks take the start command entirely from the user.
The general recommendation for starting Java based tasks on CF, is to take the start command generated by the Java buildpack to run your app or another app with the same memory limit and adjust it to start your task instead.
For example, here is the start command generated for Spring Music:
JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc) -Djava.ext.dirs= -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" && CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=27062 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration: $CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -cp $PWD/.:$PWD/.java-buildpack/container_security_provider/container_security_provider-1.16.0_RELEASE.jar org.springframework.boot.loader.WarLauncher
Note the -agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 part, which starts the jvmkill agent.
Now if I want to adjust this to run java -version. I could do the following:
JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc) -Djava.ext.dirs= -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" && CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=27062 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration: $CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -version
Note how I just changed the end, where the actual Java arguments are set.
The commands are quite long, but they do end up working and should do the trick for you.

RHadoop Stream Job Fail with Apache Oozie

I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.

All the tests passed, but bamboo build fails with a statement "No failed tests found, a possible compilation error occurred."

I'm supposed to run some jbehave(automated) tests in bamboo. Once the tests run I'll generate some junit compatible xml files so that bamboo could understand the same. All the jbehave tests are ran as part of a script, because I need to run the jbehave tests in a separate display screen(remember these are automated browser tests). Example script is as follows.
Ex:
export DISPLAY=:0 && xvfb-run --server-args="-screen 0, 1024x768x24"
mvn clean integration-test -DskipTests -P integration-test -Dtest=*
I have one more junit parser task which points to the generated junit compatible xml files. So, once the bamboo build runs and even if all the tests pass, I get red build with the message "No failed tests found, a possible compilation error occurred."
Can somone please help me on this regard.
Your build script might be producing successful test reports, but one (or both, possibly) of your tasks is failing. That means that the failure is probably* occurring after your tests complete. Check your build logs for errors. You might also try logging in to your Bamboo server (as the bamboo user) and running the commands by hand.
I've seen this message in the past when our test task was crashing halfway through the test run, resulting in one malformed report that Bamboo ignored and a bunch of successful reports.
*Check the build log to make sure that your tests are indeed running. If mvn clean doesn't clean out the test report directory, Bamboo might just be parsing stale test reports.
EDIT: (in response to Kishore's links)
It looks like your task to kill Xvfb is what is causing the build to fail.
18-Jul-2012 09:50:18 Starting task 'Kill Xvfb' of type 'com.atlassian.bamboo.plugins.scripttask:task.builder.script'
18-Jul-2012 09:50:18
Beginning to execute external process for build 'Functional Tests - Application Release Test - Default Job'
... running command line:
/bin/sh
/tmp/FUNC-APPTEST-JOB1-91-ScriptBuildTask-4153769009554485085.sh
... in: /opt/bamboo-home/xml-data/build-dir/FUNC-APPTEST-JOB1
... using extra environment variables:
<..snip (no meaningful output)..>
18-Jul-2012 09:50:18 Failing task since return code was 1 while expected 0
18-Jul-2012 09:50:18 Finished task 'Kill Xvfb'
What does your "Kill Xvfb" script do? Are you trying something like pkill -f "[x]vfb"? pkill -f silently returns non-zero if it can't match the expression to any processes.
My solution was to make a 'script' task:
#!/bin/bash
/usr/local/bin/phpcs --report=checkstyle --report-file=build/logs/checkstyle.xml --standard=PSR2 ./lib | exit 0
Which always exits with status 0.
This is because PHP code sniffer return exit status 1 when only 1 coding violation (warning / error) is found which causes the built to fail.
Turns out to be a simple fix.
General bamboo behavior is to scan the entire log and see for any failure codes(1). For this specific configuration i had some 6 scripts out of which one of them was to kill the xvfb(frame buffer). For some reason server is not able to kill xvfb and that task was returning a failure code. Because of this, though all the tests passed, bamboo got one of this error codes from previous tasks and build was failing.
Current fix is to remove the task which kills xvfb and the build went green! \o/.

Resources