Error executions Workflow install Cloudify - cloudify

I'm trying to run a workflow in cloudify, but when running the command:
cfy executions start -w install -d teste003 --debug --include-logs
The following error occurs below:
Execution of workflow 'install' for deployment 'teste003' timed out. * Run 'cfy executions cancel --execution-id c12ac2b2-fd34-4a04-a4bc-252871f9e166' to cancel the running workflow.
* Run 'cfy events list --tail --include-logs --execution-id c12ac2b2-fd34-4a04-a4bc-252871f9e166' to retrieve the execution's events/logs
Traceback (most recent call last):
File "/home/ubuntu/cloudify/bin/cfy", line 9, in <module>
load_entry_point('cloudify==3.2.1', 'console_scripts', 'cfy')()
File "/home/ubuntu/cloudify/local/lib/python2.7/site-packages/cloudify_cli/cli.py", line 37, in main
args.handler(args)
File "/home/ubuntu/cloudify/local/lib/python2.7/site-packages/cloudify_cli/cli.py", line 143, in command_cmd_handler
command['handler'](**kwargs)
File "/home/ubuntu/cloudify/local/lib/python2.7/site-packages/cloudify_cli/commands/executions.py", line 174, in start
raise SuppressedCloudifyCliError()
SuppressedCloudifyCliError
Below my file aws-ec2-blueprint.yaml:
tosca_definitions_version: cloudify_dsl_1_1
imports:
- http://www.getcloudify.org/spec/cloudify/3.2.1/types.yaml
- http://www.getcloudify.org/spec/aws-plugin/1.2.1/plugin.yaml
- http://www.getcloudify.org/spec/diamond-plugin/1.2.1/plugin.yaml
inputs:
image:
description: >
Image to be used when launching agent VM's
size:
description: >
Flavor of the agent VM's
agent_user:
description: >
User for connecting to agent VM's
node_templates:
mongod_host:
type: cloudify.aws.nodes.Instance
properties:
image_id: { get_input: image }
instance_type: { get_input: size }
My inputs.yaml:
image: ami-d05e75b8
size: m3.medium
agent_user: ubuntu
Any suggestion?

It is hard to know why the install timed out without seeing the logs of the install.
In most cases it is related to the connection to the spawned VM or an install process that keep failing.
I would try to check:
AWS permissions to spawn a VM and connect to it
Security groups, port 22 open to the manager VM
Internet access of the spawned VM

Related

failure to install airflow helm chart on GKE due to failing migration

I have been trying to set up airflow using the official airflow helm chart from artifacthub on a GKE cluster but coming up with a couple of issues.
First I get these errors from the pods:
Failed to load logs: container "scheduler" in pod "airflow-scheduler-6b6cc9db4-qmvbw" is waiting to start: PodInitializing
Reason: BadRequest (400)
Then from the init containers, I get the following error:
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 8, in <module>
sys.exit(main())
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/__main__.py", line 39, in main
args.func(args)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 52, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/db_command.py", line 138, in check_migrations
db.check_migrations(timeout=args.migration_wait_timeout)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/db.py", line 739, in check_migrations
f"There are still unapplied migrations after {timeout} seconds. Migration"
TimeoutError: There are still unapplied migrations after 60 seconds. MigrationHead(s) in DB: set() | Migration Head(s) in Source Code: {'ecb43d2a1842'}
The postgres pods look okay.
The second issue is that I get the error that the pods are not scheduled because no nodes are available, and the cluster doesn't autoscale because even after that, the nodes will not contain the new pods.
The challenge with the above is that I upgraded the nodes to have 32 processors and 64GB memory, but the error persists. So I assume it is something else.

Odd error when attempting net_put via Ansible

Looking for assistance with an odd error I am troubleshooting with a playbook.
I have a working SSH session to a switch, but having difficulty with transferring files via SCP on Ansible. I can start a SCP session directly from the same server with no issues and can transfer a text file (the same one references below) but it does not seem to work in Ansible.
I enabled verbose logging via Ansible and this is what I am seeing in the logfile generated.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/ansible/utils/jsonrpc.py", line 46, in handle_request
result = rpc_method(*args, **kwargs)
File "/root/.ansible/collections/ansible_collections/ansible/netcommon/plugins/connection/network_cli.py", line 1282, in copy_file
self.ssh_type_conn.put_file(source, destination, proto=proto)
File "/root/.ansible/collections/ansible_collections/ansible/netcommon/plugins/connection/libssh.py", line 498, in put_file
raise AnsibleError(
ansible.errors.AnsibleError: Error transferring file to flash:test.txt: Initializing SCP session of remote file [flash:test.txt] for w>
2022-10-06 11:58:35,671 p=535932 u=root n=ansible | fatal: [%remoteSwitch%]: FAILED! => {
"changed": false,
"destination": "flash:test.txt",
"msg": "Exception received: Error transferring file to flash:test.txt: Initializing SCP session of remote file [flash:test.txt] fo>
}
Afraid Google is not helping me too much with this one. If it helps, this is on Ubuntu 22.04, with Ansible 2.10.8.
Play attempting to be ran is:
- hosts: %remoteSwitch%
vars:
- firmware_image_name: "test.txt"
tasks:
- name: Copying image to the switch... This can take time, please wait...
net_put:
src: "/etc/ansible/firmware_images/C2960X/{{ firmware_image_name }}"
dest: "flash:{{ firmware_image_name }}"
vars:
ansible_command_timeout: 20
protocol: scp
I ran into this same error. I was able to fix it by switching back to Paramiko SSH. This can be accomplished by either pip uninstall ansible-pylibssh (note, this very likely has other side-effects).
Alternatively, you can force Paramiko usage at the Ansible play level:
---
- name: Test putting a file onto Cisco IOS/IOS-XE device
hosts: cisco1
# ansible-pylibssh errors out here (force paramiko usage)
vars:
ansible_network_cli_ssh_type: paramiko
tasks:
- name: Copy file
ansible.netcommon.net_put:
src: my_file1.txt
dest : flash:/my_file1.txt
protocol: scp
It would be helpful to know what type of connection it is and what platform.
I see from the file that it is a Cisco IOS device. Do you have the following settings?
ansible_connection: ansible.netcommon.network_cli
ansible_network_os: cisco.ios.ios
The following documentation mentions the need for paramiko. Will it work if you change the ssh_type to paramiko?
https://docs.ansible.com/ansible/latest/collections/ansible/netcommon/net_put_module.html
ssh_type can be set as follows:
configuration:
INI entry:
[persistent_connection]
ssh_type = paramiko
Environment variable: ANSIBLE_NETWORK_CLI_SSH_TYPE
Variable: ansible_network_cli_ssh_type
Variable: ansible_network_cli_ssh_type

Airflow live executor logs with DaskExecutor

I have an Airflow installation (on Kubernetes). My setup uses DaskExecutor. I also configured remote logging to S3. However when the task is running I cannot see the log, and I get this error instead:
*** Log file does not exist: /airflow/logs/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log
*** Fetching from: http://airflow-worker-74d75ccd98-6g9h5:8793/log/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-74d75ccd98-6g9h5', port=8793): Max retries exceeded with url: /log/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7d0668ae80>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Once the task is done, the log is shown correctly.
I believe what Airflow is doing is:
for finished tasks read logs from s3
for running tasks, connect to executor's log server endpoint and show that.
Looks like Airflow is using celery.worker_log_server_port to connect to my dask executor to fetch logs from there.
How to configure DaskExecutor to expose log server endpoint?
my configuration:
core remote_logging True
core remote_base_log_folder s3://some-s3-path
core executor DaskExecutor
dask cluster_address 127.0.0.1:8786
celery worker_log_server_port 8793
what i verified:
- verified that the log file exists and is being written to on the executor while the task is running
- called netstat -tunlp on executor container, but did not find any extra port exposed, where logs could be served from.
UPDATE
have a look at serve_logs airflow cli command - I believe it does exactly the same.
We solved the problem by simply starting a python HTTP handler on a worker.
Dockerfile:
RUN mkdir -p $AIRFLOW_HOME/serve
RUN ln -s $AIRFLOW_HOME/logs $AIRFLOW_HOME/serve/log
worker.sh (run by Docker CMD):
#!/usr/bin/env bash
cd $AIRFLOW_HOME/serve
python3 -m http.server 8793 &
cd -
dask-worker $#

SSH connectivity issues with ntc-ansible modules

I am trying to using the ntc-ansible module with Ansible running on Ubuntu (WSL). I have ssh connectivity to my remote device (Cisco 2960X) and I can run ansible playbooks to the same remote switch using the built in Ansible networking modules (ios_command) and it works fine.
Issue:
When I try to run any of the ntc-ansible modules, it fails, unable to connect to the device. Probably something simple, but I have hit a wall. There is something I am missing about how to use ntc-ansible modules. Ansible is seeing the modules as I can look at the docs as was suggested as a test in the readme.
I have ntc-ansible module installed here: /home/melshman/.ansible/plugins/modules/ntc-ansible
I am running my playbooks from here: ~/projects/ansible/
The first time I ran the playbook with the ntc-ansible modules it failed and based on error message and some research I installed sshpass (sudo apt-get install sshpass). But still having ssh problems using ntc-ansible… (playbook and traceback below)
I hear folks taking about an index file, but I can’t find that file? Where does it live and what do I need to do with it?
What is my connection supposed to be setup to be? Local? SSH? Netmiko_ssh?
What should I be using for platform? Cisco_ios? cisco_ios_ssh?
Appreciate any help I can get. I have been running in circles for hours and hours.
Ansible Version Info:
VTMNB17024:~/projects/ansible $ ansible --version
ansible 2.5.3
config file = /home/melshman/projects/ansible/ansible.cfg
configured module search path = [u'/home/melshman/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
executable location = /usr/local/bin/ansible
python version = 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609]
Working playbook (ios_command:) note: ansible_ssh_pass and ansible_user in group var:
- name: Test Net Automation
hosts: ctil-ios-upgrade
connection: local
gather_facts: no
tasks:
- name: Grab run config
ios_command:
commands:
- show run
register: config
- name: Create backup of running configuration
copy:
content: "{{config.stdout[0]}}"
dest: "backups/show_run_{{inventory_hostname}}.txt"
Playbook (not working) using ntc-ansible module (Note: username and password are defined in Group VAR:
- name: Cisco IOS Automation
hosts: ctil-ios-upgrade
connection: local
gather_facts: no
tasks:
- name: GET UPTIME
ntc_show_command:
connection: ssh
platform: "cisco_ios"
command: 'show version | inc uptime'
host: "{{ inventory_hostname }}"
username: "{{ username }}"
password: "{{ password }}"
use_templates: True
template_dir: /home/melshman/.ansible/plugins/modules/ntc-ansible/ntc-templates/templates
Here is the traceback I get when the error occurs:
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: netmiko.ssh_exception.NetMikoTimeoutException: Connection to device timed-out: cisco_ios VTgroup_SW:22
fatal: [VTgroup_SW]: FAILED! => {"changed": false, "module_stderr": "Traceback (most recent call last):\n File \"/tmp/ansible_RJRY9m/ansible_module_ntc_save_config.py\", line 279, in \n main()\n File \"/tmp/ansible_RJRY9m/ansible_module_ntc_save_config.py\", line 251, in main\n device = ntc_device(device_type, host, username, password, **kwargs)\n File \"/usr/local/lib/python2.7/dist-packages/pyntc-0.0.6-py2.7.egg/pyntc/__init__.py\", line 35, in ntc_device\n return device_class(*args, **kwargs)\n File \"/usr/local/lib/python2.7/dist-packages/pyntc-0.0.6-py2.7.egg/pyntc/devices/ios_device.py\", line 39, in __init__\n self.open()\n File \"/usr/local/lib/python2.7/dist-packages/pyntc-0.0.6-py2.7.egg/pyntc/devices/ios_device.py\", line 55, in open\n verbose=False)\n File \"build/bdist.linux-x86_64/egg/netmiko/ssh_dispatcher.py\", line 178, in ConnectHandler\n File \"build/bdist.linux-x86_64/egg/netmiko/base_connection.py\", line 207, in __init__\n File \"build/bdist.linux-x86_64/egg/netmiko/base_connection.py\", line 693, in establish_connection\nnetmiko.ssh_exception.NetMikoTimeoutException: Connection to device timed-out: cisco_ios VTgroup_SW:22\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1}
Here is a working solution using ntc_show_command to a Cisco IOS device.
- name: Cisco IOS Automation
hosts: pynet-rtr1
connection: local
gather_facts: no
tasks:
- name: GET UPTIME
ntc_show_command:
connection: ssh
platform: "cisco_ios"
command: 'show version'
host: "{{ ansible_host }}"
username: "{{ ansible_user }}"
password: "{{ ansible_ssh_pass }}"
use_templates: True
template_dir: '/home/kbyers/ntc-templates/templates'
If you are going to use ntc-templates, I probably would not have the '| include uptime' in the 'show version'. In other words, let TextFSM convert the output to structured data first and then grab the uptime from that structured data.
I modified inventory_hostname to ansible_host to be consistent with my inventory format (my inventory_hostname doesn't actually resolve in DNS).
I modified username and password to 'ansible_user' and 'ansible_ssh_pass' to be consistent with my inventory and also to be more consistent with Ansible 2.5/2.6 variable naming.
On your above issue, your exception message does not match your playbook (i.e. are you sure that is the exception you get for that playbook).
Here is my inventory file (I simplified this to remove some unnecessary devices and to hide confidential information)
[all:vars]
ansible_connection=local
ansible_python_interpreter=/home/kbyers/VENV/ansible/bin/python
ansible_user=user
ansible_ssh_pass=password
[local]
localhost ansible_connection=local
[cisco]
pynet-rtr1 ansible_host=cisco1.domain.com
pynet-rtr2 ansible_host=cisco2.domain.com

openwhisk postdeploy fails on single node ubuntu virtual machine

I am trying to run openwhisk serverless framework on a single node ubuntu vm.
I am following the instructions here.
I followed the instructions for database set up and then went over to the steps listed for ansible single node: (ansible/README.md)
Using the steps under "Deploy Using CouchDB", in the following step:
ansible-playbook -i environments/<environment> postdeploy.yml
I get an error in running installCatalog.sh
Looks like the URL 172.17.0.1 is not accesible. Where am I going wrong?
TASK [install the catalog from the catalog location] ***************************
Thursday 04 May 2017 10:41:29 +0000 (0:00:01.602) 0:00:09.063 **********
fatal: [ansible]: FAILED! => {"changed": true, "cmd": "./installCatalog.sh /home/techie/openwhisk/ansible/../ansible/files/auth.whisk.system 172.17.0.1 /whisk.system /home/techie/openwhisk/ansible/../bin/wsk", "delta": "0:00:01.840405", "end": "2017-05-04 10:41:32.380241", "failed": true, "rc": 7, "start": "2017-05-04 10:41:30.539836", "stderr": "error: Package update failed: Put 172.17.0.1/api/v1/namespaces/_/packages/websocket?overwrite=true: dial tcp 172.17.0.1:443: getsockopt: connection refused\nerror: Package update failed: Put 172.17.0.1/api/v1/namespaces/_/packages/combinators?overwrite=true: dial tcp 172.17.0.1:443: getsockopt: connection refused\nerror: Package update failed: Put 172.17.0.1/api/v1/namespaces/_/packages/watson-speechToText?overwrite=true: dial tcp 172.17.0.1:443: getsockopt: connection refused\nerror: Package update failed: Put 172.17.0.1/api/v1/namespaces/_/packages/utils?overwrite=true: dial tcp 172.17.0.1:443: getsockopt: connection refused\nerror: Package update failed:
.......
I ran docker ps after the deployment step. There were several dockers like zookeeper, kafka, etc. running. Is there supposed to be a nginx docker running too? In my set-up there was no nginx docker running.
In the config files, I have base url set to 172.17.0.1 - is this ok, or could it be something else?
I found that I needed to also run edge.yml after apigateway.yml and before postdeploy.yml to get the postdeploy script to work and then to be able to have the wsk tool work against the API endpoint.

Resources