SBT forces file system locking, even on distributed file systems - sbt

I intended to run an extensive test suite that uses SBT on our university high performance computing cluster (it uses the Lustre file system).
Since I have very basic user privileges, I was only able to try installing manually and installing by extracting the tarball.
Even with -Dsbt.boot.lock=false, I get the following stack trace:
java.io.IOException: Function not implemented
at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:89)
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1024)
at java.nio.channels.FileChannel.tryLock(FileChannel.java:1154)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:86)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at xsbt.boot.Launch.locked(Launch.scala:238)
at xsbt.boot.Launch.app(Launch.scala:147)
at xsbt.boot.Launch.app(Launch.scala:145)
at xsbt.boot.Launch$.run(Launch.scala:102)
at xsbt.boot.Launch$$anonfun$apply$1.apply(Launch.scala:35)
at xsbt.boot.Launch$.launch(Launch.scala:117)
at xsbt.boot.Launch$.apply(Launch.scala:18)
at xsbt.boot.Boot$.runImpl(Boot.scala:41)
at xsbt.boot.Boot$.main(Boot.scala:17)
at xsbt.boot.Boot.main(Boot.scala)
Error during sbt execution: java.io.IOException: Function not implemented
The issue is that parallel distributed file systems such as Lustre and NFS do not implement lock0, but SBT seems to rely on it.
Not being able to run my test suite on a high performance cluster is a huge disadvantage, as it takes at least 6 hours to run the test suite on my Intel i5, 7200 rpm HDD laptop (which is my only alternative). I do not have access to any file system other than the distributed file system, so putting the boot directory elsewhere is not an option.
I had intended to submit this as an issue on GitHub, but the community guidelines indicate that posting a question on StackOverflow is a better option for this particular kind of issue.
I ended up running the tests overnight on my laptop, but I am not very pleased with this. Unless this is fixed, I will not be able to continue using SBT for my research on actor based testing.

You need to mount the client with -o flock in order to enable distributed locking.

Related

mpirun error of oneAPI with Slurm (and PBS) in old cluster

Recently I installed Intel OneAPI including c compiler, FORTRAN compiler and mpi library and complied VASP with it.
Before presenting the question, there are some tricks I need to clarify during the installation of VASP:
GLIBC2.14: the cluster is an old machine with a glibc version of 2.12, where OneAPI needs a version of 2.14. So I compile the GLIBC2.14 and export the ld_path: export LD_LIBRARY_PATH="~/mysoft/glibc214/lib:$LD_LIBRARY_PATH"
ld 2.24: The ld version is 2.20 in the cluster, while a higher version is needed. So I installed binutils 2.24.
There is one master computer connected with 30 calculating nodes in the cluster. The calculation is executed with 3 ways:
When I do the calculation in the master, it's totally OK.
When I login the nodes manually with rsh command, the calculation in the logged node is also no problem.
But usually I submit the calculation script from the master (with slurm or pbs), and then do the calculation in the node. In that case, I met following error message:
[mpiexec#node3.alineos.net] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec#node3.alineos.net] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec#node3.alineos.net] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec#node3.alineos.net] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec#node3.alineos.net] Possible reasons:
[mpiexec#node3.alineos.net] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec#node3.alineos.net] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec#node3.alineos.net] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec#node3.alineos.net] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I only met this error with oneAPI compiled codes but Intel® Parallel Studio XE compiled. Do you have any idea of this error? Your response will be highly appreciated.
Best,
Léon
Could it be a permissions error with the Slurm agent not having the correct permissions or library path?

ORTE problem when running MPI on multiple computing nodes

I am trying to run a simple MPI example on a cluster with multiple computing nodes. Now I am just using two test nodes, including gpu8 and gpu12.
What I've done include:
gpu8 and gpu12 have the correct MPI environment (OpenMPI-4.0.1). I can successfully run the MPI example on a single node.
Passwordless login between gpu8 and gpu12 has been setup. They can ssh to another node with no issues.
There is a hostfile on each node containing
gpu8
gpu12
The executable files are under the same path.
echo $PATH (on both nodes) gives
/home/user_1/share/local/openmpi-4.0.1/bin:xxxxxx
echo $LD_LIBRARY_PATH (on both nodes) gives
/home/t716/shshi/share/local/openmpi-4.0.1/lib:
The ORTE problem:
I am running mpirun -np 2 --hostfile /home/user_2/hosts ./home/user_2/mpi-hello-world/mpi_hello_world. The error output is:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------

Petsc error when running Openmdao v1.7.3 tutorials and benchmarks

I have tried running the Openmdao paraboloid tutorial as well as benchmarks and I consistently receive the same error which reads as following:
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash
---------------------------------------------------------------------
MPI_abort was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 59.
NOTE: invoking MPI_ABORT causes MPI to kill all MPI processes.
you may or may not see output from other processes, depending on exactly when Open MPI kills them.
I don't understand why this error is occurring and what I can do to be able to run OpenMDAO without getting this error. Can you please help me with this?
Something has not gone well with your PETSc install. Its hard to debug that from afar though. It could be in your MPI install, or your PETSc install, or your petsc4py install. I suggest not install PETSc or PETSc4Py through pip though. I've had mixed success with that. Both can be installed from source without tremendous difficulty.
However, to run the tutorials you don't need to have PETSc installed. You could remove those packages and the tutorials will run correctly in serial for you

Git-svn out of memory

I'm trying to clone a reasonably big svn repository with git-svn and at a certain point I get a error message:
Failure loading plugin: APR: Can't create a character converter from 'UTF-8' to native encoding: Cannot allocate memory at /usr/libexec/git-core/git-svn line 5061
And sometimes a
Cannot allocate memory: zlib (compress2): out of memory: Compression of svndiff data failed at /usr/libexec/git-core/git-svn line 5061
error message. I still have ~3GB RAM free. What should I do so git-svn can utilize it?
(I'm doing this on RedHat Enterprise Linux 6.5 if that makes any difference)
From:
This error message is about the memory git is trying to allocate --
it's more than what is free. This is most likely caused by a large
file having been checked into SVN. Unfortunately, there's no easy way
to fix it (apart from buying more memory) -- you would have to remove
the large file and the commit adding it from SVN.
However try following:
Increase swap memory
Increase ulimit

SBT runs out of memory

I am using SBT 0.12.3 to test some code and often I get this error message while testing interactively with the ~test command.
8. Waiting for source changes... (press enter to interrupt)
[info] Compiling 1 Scala source to C:\Users\t\scala-projects\scala test\target\s
cala-2.10\classes...
sbt appears to be exiting abnormally.
The log file for this session is at C:\Users\t\AppData\Local\Temp\sbt566325905
3150896045.log
java.lang.OutOfMemoryError: PermGen space
at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
at java.util.concurrent.FutureTask.get(Unknown Source)
at sbt.ConcurrentRestrictions$$anon$4.take(ConcurrentRestrictions.scala:
196)
at sbt.Execute.next$1(Execute.scala:85)
at sbt.Execute.processAll(Execute.scala:88)
at sbt.Execute.runKeep(Execute.scala:68)
at sbt.EvaluateTask$.run$1(EvaluateTask.scala:162)
at sbt.EvaluateTask$.runTask(EvaluateTask.scala:177)
at sbt.Aggregation$$anonfun$4.apply(Aggregation.scala:46)
at sbt.Aggregation$$anonfun$4.apply(Aggregation.scala:44)
at sbt.EvaluateTask$.withStreams(EvaluateTask.scala:137)
at sbt.Aggregation$.runTasksWithResult(Aggregation.scala:44)
at sbt.Aggregation$.runTasks(Aggregation.scala:59)
at sbt.Aggregation$$anonfun$applyTasks$1.apply(Aggregation.scala:31)
at sbt.Aggregation$$anonfun$applyTasks$1.apply(Aggregation.scala:30)
at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.sca
la:62)
at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.sca
la:62)
at sbt.Command$.process(Command.scala:90)
at sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71
)
at sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71
)
at sbt.State$$anon$2.process(State.scala:170)
at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71)
at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.MainLoop$.next(MainLoop.scala:71)
at sbt.MainLoop$.run(MainLoop.scala:64)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:53)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:50)
at sbt.Using.apply(Using.scala:25)
at sbt.MainLoop$.runWithNewLog(MainLoop.scala:50)
at sbt.MainLoop$.runAndClearLast(MainLoop.scala:33)
at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:17)
Error during sbt execution: java.lang.OutOfMemoryError: PermGen space
The error is clear, I cloud increase the heap size and it may stop throwing that error, but the thing is that it shuts down after a number(I don't know how many) of test interactions with a minimal change in the code, and if a simple increase in the heap would solve the problem or do I have to do additional work not to run out of memory.
Thanks in advance.
If you haven't, try giving more PermGen space in your sbt.bat. I don't run sbt on Windows, but I give java -Xmx1512M -XX:MaxPermSize=512M.
Another thing to try may be to fork during testing: https://www.scala-sbt.org/1.x/docs/Testing.html#Forking+tests
Test / fork := true
specifies that all tests will be executed in a single external JVM.

Resources