Openstack - mix pinned and unpinned VMs on the same host - openstack

Is it possible in Openstack to have 2 VMs instantiated on the same host, where:
VM1 is instantiated from the "unpinned" 2-vCPU flavor (hw:cpu_policy not set)
VM2 is instantiated from the "pinned" 2-vCPU flavor (hw:cpu_policy=dedicated)
and be sure that VM2's pinned vCPUs (thus physical CPUs) will not be used by VM1?
When reading the 'CPU topologies' section in the OpenStack docs it says:
Caution: Host aggregates should be used to separate pinned instances
from unpinned instances as the latter will not respect the resourcing
requirements of the former.
so according to above it looks it's not possible. Would like to confirm that.
Cause if you can't mix pinned and unpinned VMs on 1 host it seems to me like a huge limitation, isn't it? Asking in the telecom context where pinning is often a must for some VMs (VNFCs) and for the others not; and sometimes it's desirable to have them on the same host.

At the moment Nova does not support mixed pinned/unpinned on a single VM (at least not out of the box), but it is a feature they are looking at implementing.
You can read about the current suggested implementation here.
Additional reading on work related to this blueprint is available here.

Related

Typical resource request required for an nginx file explorer deployed on kubernetes

I have 2 nfs mounts of 100TB each i.e. 200TB in total. I have mounted these 2 on Kubernetes container. My file server is a typical log server that holds a mix of data types like JSON, HTML, images, logs and text files, etc. The size of files also varies a lot. I am kind of guessing what should be the ideal resource request for this kubernetes container? My assumption,
As this is file reads its i/o intensive operation, CPU should be high
Since we may have a large file size transferred over, Memory should also be high.
Just wanted to check if my assumptions are right?
Posting this community wiki answer to set a baseline and to show one possible set of actions that should led to solution.
Feel free to edit and expand.
As I stated previously, this setup will heavily depend on case to case basis and giving the approximate could be misleading. In my opinion the best course of actions to take would be:
Install monitoring tools
Deploy the application for testing
Simulate the load
Install monitoring tools
There are a lot of monitoring tools that can retrieve the data about the CPU and Memory usage of your Pods. You will need to choose the one that suits your workloads and infrastructure best.
Some of them are:
Prometheus.io
Elastic.co
Datadoghq.com
Deploy the application for testing
This can also be a quite wide topic considering the fact that the exact requirements and the infrastructure is not known. One of many questions is if the Deployment should have a steady replica amount or should use some kind of Horizontal Pod Autoscaling (basing on CPU and/or Memory). The access modes on the storage shouldn't matter as NFS supports RWX.
The basic implementation of the Deployment that could be used can be found in the official Kubernetes documentation:
Kubernetes.io: Docs: Concepts: Workloads: Controllers: Deployment: Creating a deployment
Kubernetes.io: Docs: Concepts: Storage: Volumes: NFS
Simulate the load
The simulation part could go either as a real life usage or by using a tool to simulate the load. You would need in this part to choose the option/tool that suits your requirements the most. This part will show you the approximate resources that should be allocated to your nginx file explorer.
A side note!
In my testing I've used ab to check if the load was divided equally by X amount of replicas.
Additional resources
I do recommend to check the official guide on official Kubernetes documentation regarding managing resources:
Kubernetes.io: Docs: Concepts: Configuration: Manage resources containers
I also think that the VPA could help you in the whole process as:
Vertical Pod Autoscaler (VPA) frees the users from necessity of setting up-to-date resource limits and requests for the containers in their pods. When configured, it will set the requests automatically based on usage and thus allow proper scheduling onto nodes so that appropriate resource amount is available for each pod. It will also maintain ratios between limits and requests that were specified in initial containers configuration.
It can both down-scale pods that are over-requesting resources, and also up-scale pods that are under-requesting resources based on their usage over time.
-- Github.com: Kubernetes: Autoscaler: Vertical Pod Autoscaler
I'd reckon you could also look on this answer:
Stackoverflow.com: Answers: PromQL query to find CPU and memory used for the last week

Do all servers have one base OS, like in RED HAT openstack architecture?

I'm a noob learning openstack. And The resources are all over the place tbh. I came across this image and would like to know one thing,
So, Suppose I have 100TB of storage and 10 server grade processors, and ram of 1TB, do all these resources make up of only one base OS- RED hat enterprise Linux? So, they sell resources to connect all the equipment and connect to install one single OS which can comprehend them all?
And Upon this, we throw an Openstack architecture so clients can use them as needed? Do we need as many NICs or the NICs virtual?
How to scale?
As you say, you just add a server. Install RHEL or another supported Linux distro (it's best to install the same distro and version on all servers), then OpenStack and configure it. The new server will register with the OpenStack controllers and can be used for launching virtual machines immediately.
The process is a bit more involved when you run a cloud with baremetal instances (i.e. you don't launch VMs but provision physical systems), but in principle it's the same.
by definition(at consumer scale-like one laptop) we need a network interface card for one IP
This is incorrect. You can configure multiple IP addresses on a single interface, even on your PC at home, even if that PC runs Windows.
An enterprise cloud requires connecting nodes to several networks. Usually, servers have several physical NICs, bond them together, and use VLANs or other multiplexing technologies to implement the networks. See this blog (five years old, but the principles still apply today, and it's well-written) for a good example of a real-world OpenStack network architecture.
Openstack uses one big special NIC
OpenStack can be deployed in many ways. It is not a shrink-wrapped solution. It can be used on servers with single NICs, bonded NICs, VLANs, normal networks, etc. Your statement is almost correct if you think of a typical deployment and a bond interface as a "big special NIC".
If you are interested to try this out at home, see the OpenStack installation tutorial. You will learn a lot.

Migrate from legacy network in GCE

Long story short - I need to use networking between projects to have separate billing for them.
I'd like to reach all the VMs in different projects from a single point that I will use for provisioning systems (let's call it coordinator node).
It looks like VPC network peering is a perfect solution to this. But unfortunately one of the existing networks is "legacy". Here's what google docs state about legacy networks.
About legacy networks
Note: Legacy networks are not recommended. Many newer GCP features are not supported in legacy networks.
OK, naturally the question arises: how do you migrate out of legacy network? Documentation does not address this topic. Is it not possible?
I have a bunch of VMs, and I'd be able to shutdown them one by one:
shutdown
change something
restart
unfortunately it does not seem possible to change network even when VM is down?
EDIT:
it has been suggested to recreate VMs keeping the same disks. I would still need a way to bridge legacy network with new VPC network to make migration fluent. Any thoughts on how to do that using GCE toolset?
One possible solution - for each VM in the legacy network:
Get VM parameters (API get method)
Delete VM without deleting PD (persistent disk)
Create VM in the new VPC network using parameters from step 1 (and existing persistent disk)
This way stop-change-start is not so different from delete-recreate-with-changes. It's possible to write a script to fully automate this (migration of a whole network). I wouldn't be surprised if someone already did that.
UDPATE
https://github.com/googleinterns/vm-network-migration tool automates the above process, plus it supports migration of a whole Instance Group or Load Balancer, etc. Check it out.

What is the difference between cold and hot reboot in openstack

I am new to openstack for virtualization.
I can reboot instance by 2 ways: cold and hard reboot.
I can understand the difference on a physical computer, but what is the difference between cold and hot reboot on a VM ?
Thanks
Apart from the documentation here that it's already mentioned on this thread:
http://docs.openstack.org/user-guide/cli-reboot-an-instance.html
A hard-reboot also affects the virtual machine at hypervisor level. Example: If you are using libvirt-based hypervisors (qemu/kvm), the instance control file (the libvirt XML representing the virtual machine in Libvirt) get's reconstructed from scratch.
That's very usefull when for any reason the instance storage space (/var/lib/nova/instances/INSTANCE_UUID) suffers any kind of problem, or, in general for any reason that you need OpenStack to reconstruct the libvirt definitions !.
It affects both the XML libvirt definition normally stored at /etc/libvirt/qemu and the copy at /var/nova/instances/INSTANCE_UUID.
So, in resume: Use hard-reboot if you need to fully reset/reboot the instance up to Hypervisor level. As you can see, is more like a "power-cycle with steroids".
Hope this helps !!

Kubernetes and MPI

I want to run an MPI job on my Kubernetes cluster. The context is that I'm actually running a modern, nicely containerised app but part of the workload is a legacy MPI job which isn't going to be re-written anytime soon, and I'd like to fit it into a kubernetes "worldview" as much as possible.
One initial question: has anyone had any success in running MPI jobs on a kube cluster? I've seen Christian Kniep's work in getting MPI jobs to run in docker containers, but he's going down the docker swarm path (with peer discovery using consul running in each container) and I want to stick to kubernetes (which already knows the info of all the peers) and inject this information into the container from the outside. I do have full control over all the parts of the application, e.g. I can choose which MPI implementation to use.
I have a couple of ideas about how to proceed:
fat containers containing slurm and the application code -> populate
the slurm.conf with appropriate info about the peers at container
startup -> use srun as the container entrypoint to start the jobs
slimmer containers with only OpenMPI (no slurm) -> populate a
rankfile in the container with info from outside (provided by
kubernetes) -> use mpirun as the container entrypoint
an even slimmer approach, where I basically "fake" the MPI runtime by
setting a few environment variables (e.g. the OpenMPI ORTE ones) ->
run the mpicc'd binary directly (where it'll find out about its peers
through the env vars)
some other option
give up in despair
I know trying to mix "established" workflows like MPI with the "new hotness" of kubernetes and containers is a bit of an impedance mismatch, but I'm just looking for pointers/gotchas before I go too far down the wrong path. If nothing exists I'm happy to hack on some stuff and push it back upstream.
I tried MPI Jobs on Kubernetes for a few days and solved it by using dnsPolicy:None and dnsConfig (CustomDNS=true feature gate will be needed).
I pushed my manifests (as Helm chart) here.
https://github.com/everpeace/kube-openmpi
I hope it would help.
Assuming you don't want to use hw-specific MPI library (for example anything that uses direct access to communication fabric), I would go with option 2.
First, implement a wrapper for mpirun which populates necessary data
using kubernetes API, specifically using endpoints if using a
service (might be a good idea), could also scrape pod's exposed
ports directly.
Add some form of checkpoint program that can be used for
"rendezvous" synchronization before starting actual running code (I
don't know how well MPI deals with ephemeral nodes). This is to
ensure that when mpirun starts it has stable set of pods to use
And finally actually build a container with necessary code and I
guess SSH service for mpirun to use for starting processes in
other pods.
Another interesting option would be to use Stateful Sets, possibly even running with SLURM inside, which implement a "virtual" cluster of MPI machines running on kubernetes.
This provides stable hostnames for each node, which would reduce the problem of discovery and keeping track of state. You could also use statefully-assigned storage for container's local work filesystem (which, with some work, could be made to for example always refer to same local SSD).
Another benefit is that it would be probably least invasive to the actual application.

Resources