I want to run an MPI job on my Kubernetes cluster. The context is that I'm actually running a modern, nicely containerised app but part of the workload is a legacy MPI job which isn't going to be re-written anytime soon, and I'd like to fit it into a kubernetes "worldview" as much as possible.
One initial question: has anyone had any success in running MPI jobs on a kube cluster? I've seen Christian Kniep's work in getting MPI jobs to run in docker containers, but he's going down the docker swarm path (with peer discovery using consul running in each container) and I want to stick to kubernetes (which already knows the info of all the peers) and inject this information into the container from the outside. I do have full control over all the parts of the application, e.g. I can choose which MPI implementation to use.
I have a couple of ideas about how to proceed:
fat containers containing slurm and the application code -> populate
the slurm.conf with appropriate info about the peers at container
startup -> use srun as the container entrypoint to start the jobs
slimmer containers with only OpenMPI (no slurm) -> populate a
rankfile in the container with info from outside (provided by
kubernetes) -> use mpirun as the container entrypoint
an even slimmer approach, where I basically "fake" the MPI runtime by
setting a few environment variables (e.g. the OpenMPI ORTE ones) ->
run the mpicc'd binary directly (where it'll find out about its peers
through the env vars)
some other option
give up in despair
I know trying to mix "established" workflows like MPI with the "new hotness" of kubernetes and containers is a bit of an impedance mismatch, but I'm just looking for pointers/gotchas before I go too far down the wrong path. If nothing exists I'm happy to hack on some stuff and push it back upstream.
I tried MPI Jobs on Kubernetes for a few days and solved it by using dnsPolicy:None and dnsConfig (CustomDNS=true feature gate will be needed).
I pushed my manifests (as Helm chart) here.
https://github.com/everpeace/kube-openmpi
I hope it would help.
Assuming you don't want to use hw-specific MPI library (for example anything that uses direct access to communication fabric), I would go with option 2.
First, implement a wrapper for mpirun which populates necessary data
using kubernetes API, specifically using endpoints if using a
service (might be a good idea), could also scrape pod's exposed
ports directly.
Add some form of checkpoint program that can be used for
"rendezvous" synchronization before starting actual running code (I
don't know how well MPI deals with ephemeral nodes). This is to
ensure that when mpirun starts it has stable set of pods to use
And finally actually build a container with necessary code and I
guess SSH service for mpirun to use for starting processes in
other pods.
Another interesting option would be to use Stateful Sets, possibly even running with SLURM inside, which implement a "virtual" cluster of MPI machines running on kubernetes.
This provides stable hostnames for each node, which would reduce the problem of discovery and keeping track of state. You could also use statefully-assigned storage for container's local work filesystem (which, with some work, could be made to for example always refer to same local SSD).
Another benefit is that it would be probably least invasive to the actual application.
Related
I am calling a shell script that does some processing from JCL using BPXBATCH like this:
//STEP2 EXEC PGM=BPXBATCH,
// PARM='SH PATHTOSCRIPT.SH MYARGUMENT'
The JCL has the service class with the highest priority. However, the shell script enters in a queue waiting for resources. Sometimes it runs quickly, and other times waits a lot of time for resources. The priority of the JCL seems to be independent of the shell script. I read maybe using the "nice" command in Unix would increase the priority of the shell script.
I want to be sure first, that the priority of a JCL from z/OS doesn't affect the priority of Unix process that was called from that JCL through BPXBATCH. I cannot find any documentation about it.
Short Answer
To answer your question first: BPXBATCH runs in one address space, and the shell runs in a second address space. Commands issues by the shell may run in the same address space as the shell, or may run in more additional address spaces.
The BPXBATCH address space has got a service class, and the shell address space(s) has got a service class, probably a different one. Each service class has its own performance goal, and this tells the system how to manage that work.
Detailed Answer
The z/OS workload manager (WLM) is responsible to assign work to a service classes when it is presented the new work. Service classes specify performance goals, and importance levels, not priorities. WLM manages all work in the system according to is performance goal based on the importance of the goal.
There are a couple (workload management) subsystems, that may start new work. Examples of such subsystems are
JES, which manage batch work, i.e. batch jobs.
TSO, which manages interactive TSO user work (TSO login).
OMVS, which manages forked, and non-locally spawned z/OS UNIX work.
STC, which manages started job workload.
This list is not complete; I listed only the subsystems that I need to answer the question.
When JES2/3 receives a job that shall run on the system, it presents some job attributes to WLM, and WLM assigns the job to a service class. It does so using WLM classification rules for subsystem type JES, and the attributes given.
Everything that runs in this job, i.e. in the job's address space will be managed towards the performance goal of the sercive class assigned. This includes z/OS UNIX work that is run in this very address space, i.e. work that is not started via UNIX fork(), or non-local spawn().
When a z/OS UNIX process starts an new process via fork(), or via non-local spawn(), this new work is handled by the WLM subsystem OMVS. The OMVS subsystem presents some attributes of the new process to WLM, and WLM assigns the process to a service class. It does so using WLM classification rules for subsystem type OMVS, and the attributes given. This kind of work is always runs in a separate, new address space.
BPXBATCH starts the (first) UNIX command it is told via PARM=, or //STDPARM, as a new process using either fork(), or spawn(). The spawn() may be a local spawn(), or a non-local spawn(). Which one is done depends on many factors, too complex to explain here.
The important point here is, when running BPXBATCH with PARM='SH ...', the shell proces will always run in a separate, new address space and will be classified via WLM subsystem OMVS.
The result is BPXBATCH is running in one address space with its service class, and the shell is run in a second address space with its service class. The service classes may be the same, but usually they are different WLM defintions with different performance goals.
As a starter, have a look at z/OS MVS Planning: Workload Management
nice() on z/OS UNIX
nice() has no effect on z/OS UNIX, unless the system has been setup to support it. There is parameter PRIORITYGOAL(...) in BPXPRMxx parmlib member to setup a list of up to 40 WLM service classes that will be used in conjunction with nice(). I have never heard of anyone having set this parameter.
See z/OS MVS Initialization & Tuning Reference for details about BPXPRMxx member
I am designing a CorDapp, which would require user input as well as API integration, and I am considering various approaches to expose flows and vault queries to the outside world.
Default option seems to be to use Corda RPC. Unless I missed something, there are only Java bindings for it, which is effectively restricting the clients to only be JVM-based. This is somewhat inconvenient, and ideally I would like something like OpenAPI to make it more open and implementation-agnostic.
Another option is to use some kind of Corda RPC to OpenAPI proxy. I know about Braid, and I'm sure there are others. Braid seems to support deployment as a Corda service packed together with the flows into the CorDapp itself, effectively making it running embedded into the Corda JVM.
Braid can be deployed as a standalone proxy too, which I suppose is option three.
Instinctively I find the embedded mode more attractive, as it reduces the number of moving parts, as opposed to a standalone mode. However, I am concerned that such model may be in fact become discouraged at some point, either because Corda developers consider it to be a misuse of services facility, or because some organisations will not be keen to deploy such code onto their nodes, especially when they may be running multiple CorDapps. I would imagine anything deployed as part of Corda JVM would at least require more scrutiny due to potential impact on other things running there, which in turn would reduce the agility.
I wonder what approach to integrate with a CorDapp is actually recommended?
Edit 1: I know it is technically possible to embed a webserver into the node and expose a REST API from there, at least in the current version of Corda (4.3 at the time of writing). The question is more about whether it is a good idea to do so, or not, and why.
Take a look at the question I had asked on Stackoverflow regarding front end for CordApp. This might be of some help.
Following is the link -
"Corda: Can we develop Dapps that will be run by IIS webserver to talk to Corda platform?"
You can use any front-end technology you want.
As of Corda 3, your backend must be JVM-based, for two reasons:
You need to load various flow, state and other class definitions onto
the classpath to pass as arguments to flows, retrieve objects from the
vault, etc.
You need to use the CordaRPCClient library to create an
RPC connection to the node
If you really need to write your back-end
in another language, there are a few workarounds:
Create a thin Java webserver that sits between your main webserver and
the node. The Java webserver translates HTTP requests from the main
webserver into RPC calls to the node, and RPC responses from the node
into HTTP responses to the main webserver
This is the approach taken
by libraries such as Braid
Use a library such as GraalVM to compile
non-JVM languages to JVM bytecode
An example of writing a JVM
webserver in Javascript using GraalVM is available here:
https://github.com/nitesh7sid/cordapp-example-nodejs-server-graalvm
We're running Docker across two hosts, with overlay networking enabled and configured. It's version 1.12.1, with Consul as the KV store - but we aren't using Swarm, largely because we didn't feel it gave us the relevant control over ensuring availability and minimising resources, but anyway.
Our setup is micro service based, and we run quite a lot of containers which get restarted fairly frequently. Our model uses nginx as a "reverse proxy" for service discovery, for various reasons, and so we start multiple containers which share a --host of "nginx-lb". This works fine, and other containers on the network can connect to nginx-lb, which gives them a random one of the containers' IP addresses.
The problem we have is that in killing containers and creating new ones, sometimes (I don't know what specific circumstance this occurs in), the overlay network does not remove the old container from the system, and so other containers then try to connect to the dead ones, causing problems.
The only way to then resolve this, is to manually call a docker network disconnect -f overlay_net [container], having already run a docker network inspect overlay_net to find the errant containers.
Is there a known issue with the overlay networking not removing dead containers from the KV data, or any ideas of a fix?
Yes it's a known issue. Follow it here https://github.com/docker/docker/issues/26244
After successful installation of devstack and launching instances,but once reboot machine, need to start all over again and lose all the instances which were launched back then.I tried rejoin-stack but did not worked,How can i get the instances back after reboot ?
You might set resume_guests_state_on_host_boot = True in nova.conf. The file should be located at /etc/nova/nova.conf
I've found some old discussion http://www.gossamer-threads.com/lists/openstack/dev/8772
AFAIK at the present time OpenStack (Icehouse) still not completely aware about environments inside it, so it can't restore completely after reboot. The instances will be there (virsh domains), but even if you start them manually or using nova flags I'm not sure whether other facilities will handle this correctly (e.g. neutron will correctly configure all L3 rules according to DB records, etc.) Honestly I'm pretty sure they won't...
The answer depends of what you need to achieve:
If you need a template environment (e.g. similar set of instances and networks each time after reboot) you may just script everything. In other words just make a bash script creating everything you need and run it each time after stack.sh. Make sure you're starting with clean environment since OpenStack DB state remains between ./unstack - ./stack.sh or ./rejoin-stack.sh (you might try to just clean DB, or delete it. stack.sh will build it back).
If you need a persistent environment (e.g. you don't want to loose VM's and whole infrastructure state after reboot) I'm not aware how to do this using OpenStack. F.e. neutron agents (they configure iptables, dhcp etc) do not save state and driven by events from Neutron service. They will not restore after reboot, so the network will be dead. I'll be very glad if someone will share a method to do such recovery.
In general I think OpenStack is not focusing on this and will not focus during the nearest release cycles. Common approach is to have multi-node environment where each node is replaceable.
See http://docs.openstack.org/high-availability-guide/content/ch-intro.html for reference
devstack is an ephemeral environment. it is not supposed to survive a reboot. this is not a supported behavior.
that being said you might find success in re-initializing the environment by running
./unstack.sh
follower by
./stack.sh
again.
Again, devstack is an ephemeral environment. It's primary purpose for existing is to run gate testing for openstack's CI infrastructure.
or try ./rejoin-stack.sh to re-join previous screens.
Is it possible to hot plug an additional node (host) into a working OpenMPI app? We're talking about production environment where we cannot afford even a 5 second downtime.
There are two scenarios I'm interested in:
We just would like to enhance the computing power by adding one more broadcast listener.
A node died, the master node handles it well and reassigns the task to somebody else. The system administrator comes in, restarts the dead node and plugs it back into the cluster.
Which platform independent MPI implementation would be best for the scenario above? OpenMPI is not a must here.
MPI-2 -- any implementation -- does allow dynamic processes, and in fact adding processes is currently much more feasible than removing processes. You can use MPI_COMM_SPAWN to launch a new process with a given executable, and that returns an intracommunicator that can be used to communicate between the old (original) processes.
The tricks here are -- nothing will automatically detect the new node. You'll have to have some process keeping an eye out for them, SPAWN something on them. If the new nodes will just be listeners to the master node, that's probably the best case, as only the master node really needs to know about it. The invocation to ensure the spawn happens on the new node and not somewhere else will be done through the info argument to spawn, and may be implementation dependant.