mpi operator tensorflow benchmark example not starting - mpi

I'm trying to run this mpiJob example, https://github.com/kubeflow/mpi-operator/blob/master/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml by follwing the steps in this readme. I deployed the configuration to a local k3s cluster, but the launcher pod is failing with the error, ssh: Could not resolve hostname tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker: Name or service not known
How can I resolve this problem?

Related

OpenStack-Devstack: Can't create instances using KVM on host

I have a Dockerize installation from Devstack all-in-one on Ubuntu 20.04. The goal for me is to connect to the host's KVM and create instances there. Nova was configured as follows for this purpose.
#/etc/nova/nova.conf
#/etc/nova/nova-cpu.conf
[libvirt]
connection_uri = qemu+ssh://root#172.10.1.1/system
When I try to build the instance, I get the following error.
Build of instance cdd6f8b4-6dcf-4a43-b96a-fb6166b20235 aborted: Failed to allocate the network(s), not rescheduling.
ovs-vsctl commands cause the error. What is the problem? Does this need to be done differently?

AWS credentials not found for celery-k8s deployment

I'm trying to run dagster using celery-k8s and using the examples/celery-k8s as a start. upon running the pipeline from playground I get
Initialization of resources [s3, io_manager] failed.
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I have configured aws credentials in env variables as mentioned in the document
deployments:
- name: "user-code-deployment-test"
image:
repository: "somasays/dagster-usercode-example"
tag: "0.5"
pullPolicy: Always
dagsterApiGrpcArgs:
- "-f"
- "/workspace/repo.py"
port: 3030
env:
AWS_ACCESS_KEY_ID: AAAAAAAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY: qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
AWS_DEFAULT_REGION: eu-central-1
and I can also see these values are set in the env variables of the pod and can also access the s3 location after pip install awscli and aws s3 ls see the screenshot below the job pod however throws Unable to locate credentials
Please help
The deployment configuration applies to the user code servers. Meanwhile the celery executor runs your pipeline code in separate kubernetes jobs. To provide your secrets there, you will want to configure the env_secrets field of the celery-k8s executor in your pipeline run config.
See https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-k8s/dagster_k8s/job.py#L321-L327 for details on the config.

Error on configuring Devstack compute nodes: Service n-net is not running

While installing Devstack on a compute node in a multi-node devstack lab environment error encountered: Service n-net is not running.
The local.conf file has localrc as:
HOST_IP=192.168.42.12 # change this per compute node
FLAT_INTERFACE=eth0
FIXED_RANGE=10.4.128.0/20
FIXED_NETWORK_SIZE=4096
FLOATING_RANGE=192.168.42.128/25
MULTI_HOST=1
LOGFILE=/opt/stack/logs/stack.sh.log
ADMIN_PASSWORD=labstack
DATABASE_PASSWORD=supersecret
RABBIT_PASSWORD=supersecret
SERVICE_PASSWORD=supersecret
DATABASE_TYPE=mysql
SERVICE_HOST=192.168.42.11
MYSQL_HOST=$SERVICE_HOST
RABBIT_HOST=$SERVICE_HOST
GLANCE_HOSTPORT=$SERVICE_HOST:9292
ENABLED_SERVICES=n-cpu,n-net,n-api-meta,c-vol
NOVA_VNC_ENABLED=True
NOVNCPROXY_URL="http://$SERVICE_HOST:6080/vnc_auto.html"
VNCSERVER_LISTEN=$HOST_IP
VNCSERVER_PROXYCLIENT_ADDRESS=$VNCSERVER_LISTEN
Please help me removing this error.
P.S: I must use nova-net and not neutron for interaction between the controller and the compute nodes.
For the Ocata release I founded a solution (2-node setup). An import part is the placement-api since the Newton update (14.0.0), so first of all enable this in all your nodes:
local.conf:
enable_service placement-api
First run ./stack.sh on your controller node and after that installation run it on the other nodes.
Also here you will see the error Service n-net is not running...
Now edit your nova.conf file in /etc/nova/nova.conf because there will be no database and database_api section:
[database]
connection=mysql+pymysql://root:DB_PASS#IP_OF_CONTROLLER_NODE/nova
[api_database]
connection=mysql+pymysql://root:DB_PASS#IP_OF_CONTROLLER_NODE/nova_api
When adding these, you can check if it works with following command:
stack#jerico-02:/devstack$ nova-manage --debug host list
host zone
0.0.0.0 internal
jerico-03 internal
jerico-02 nova
Also in the dashboard the new compute (hypervisor) shows up!
Hope it helps!
(Tested on Ubuntu Server 16.04 LTS with devstack and OpenStack Ocata)

bootstrap clodufiy3.4 error occured

I installed cloudify3.4 according to the cloudify DOCS. When I install the manager, and executed like this:
# cfy bootstrap --install-plugins -p openstack-manager-blueprint.yaml -i openstack-manager-blueprint-inputs.yaml
an error occurred:
[ERROR] Workflow failed: Task failed 'fabric_plugin.tasks.run_script' -> Timed out trying to connect to 192.168.17.15 (tried 5 times)
I have already build a extern network 192.168.17.0/24 and I have already installed
cloudify_docker_plugin-1.3.2-py27-none-linux_x86_64-Ubuntu-trusty.wgn
cloudify_fabric_plugin-1.4.1-py27-none-linux_x86_64-centos-Core.wgn
cloudify_fabric_plugin-1.4.1-py27-none-linux_x86_64-redhat-Maipo.wgn
cloudify_host_pool_plugin-1.4-py27-none-linux_x86_64-centos-Core.wgn
cloudify_openstack_plugin-1.4-py27-none-linux_x86_64-redhat-Maipo.wgn
So, how to solve this error? Thank you to everyone who helped me!
It seems that you can't connect the manager.
Please make sure that you have an ssh connection from the CLI to the manager.
Since you are bootstrapping an Openstack manager you should make sure to have an external IP if you are outside of Openstack or that the CLI is on the same network if you are on Openstack.

Bad Request Error OpenStack

I am trying to create an Instance from command line using the command,
nova boot --config-drive=true --flavor 2 --key-name key1 --image c28bc1e8-a25f-413c-9e13-fecdd5d6f522 instance1
But I got this error,
ERROR (BadRequest): Network 00000000-0000-0000-0000-000000000000,
11111111-1111-1111-1111-111111111111 could not be found. (HTTP 400)
(Request-ID: req-6dd0352e-008a-40c4-91e2-454529712ba9)
Guide me how to resolve this problem.
I’m guessing you may have the rax_default_network_flags_python_novaclient_ext Python package installed, which automatically adds those networks to the request, but are not booting an instance in the Rackspace public cloud.
This can likely be resolved using the --no-service-net and --no-public arguments, or by uninstalling the above mentioned Python module.

Resources