Advanced AWS CloudFormation - cfn-Init & cfn-Hup not working - wordpress

I am experimenting with Cloudformation CFN-INIT & CFN-HUP based on below template but the wordpress stack doesn't get created. CFN-HUP process is not started and CFN-Init throws Code1 error. Please see Stack-template and error log details below. Can anyone help me understand what's going wrong here please?
Stack-Template:
**Parameters:
DecideEnvSize:
Type: String
Default: LOW
AllowedValues:
- LOW
- MEDIUM
- HIGH
Description: Select Environment Size (S,M,L)
DatabaseName:
Type: String
Default: DB4wordpress
DatabaseUser:
Type: String
Default: ***************
DatabasePassword:
Type: String
Default: *************
NoEcho: true
TestString:
Type: String
Default: Don't eat yourself up!!!
Mappings:
MyRegionMap:
us-east-1:
"AMALINUX" : "ami-c481fad3" # AMALINUX SEP 2016 - N. Verginia
us-east-2:
"AMALINUX" : "ami-71ca9114" # AMALINUX SEP 2016 - Ohio
InstanceSize:
LOW:
"EC2" : "t2.micro"
"DB" : "db.t2.micro"
MEDIUM:
"EC2" : "t2.small"
"DB" : "db.t2.small"
HIGH:
"EC2" : "t2.medium"
"DB" : "db.t2.medium"
Resources:
DBServer:
Type: "AWS::RDS::DBInstance"
Properties:
AllocatedStorage: 5
StorageType: gp2
DBInstanceClass: !FindInMap [InstanceSize, !Ref DecideEnvSize, DB] # Dynamic mapping + Pseudo Parameter
DBName: !Ref DatabaseName
Engine: MySQL
MasterUsername: !Ref DatabaseUser
MasterUserPassword: !Ref DatabasePassword
DeletionPolicy: Delete
EC2server:
Type: "AWS::EC2::Instance"
DependsOn: DBServer
Properties:
ImageId: !FindInMap [MyRegionMap, !Ref "AWS::Region", AMALINUX] # Dynamic mapping + Pseudo Parameter
InstanceType: !FindInMap [InstanceSize, !Ref DecideEnvSize, EC2]
KeyName: AdvancedCFN
**UserData:
"Fn::Base64":
!Sub |
#!/bin/bash
yum update -y aws-cfn-bootstrap # good practice - always do this.
/opt/aws/bin/cfn-init -v --stack ${AWS::StackName} --resource EC2server --configsets wordpress --region ${AWS::Region}
yum -y update
Metadata:
AWS::CloudFormation::Init:
configSets:
wordpress:
- "configure_cfn"
- "install_wordpress"
- "config_wordpress"
configure_cfn:
files:
/etc/cfn/cfn-hup.conf:
content: !Sub |
[main-just some name]
stack=${AWS::StackId}
region=${AWS::Region}
verbose=true
interval=5
mode: "000400"
owner: root
group: root
/etc/cfn/hooks.d/cfn-auto-reloader.conf:
content: !Sub |
[cfn-auto-reloader-hook #just a name]
triggers=post.update
path=Resources.EC2server.Metadata.AWS::CloudFormation::Init
action=/opt/aws/bin/cfn-init -v --stack ${AWS::StackName} --resource EC2server --configsets wordpress --region ${AWS::Region}
mode: "000400"
owner: root
group: root
/var/www/html/index2.html:
content: !Ref TestString
services:
sysvinit:
cfn-hup:
enabled: "true"
ensureRunning: "true"
files:
- "/etc/cfn/cfn-hup.conf"
- "/etc/cfn/hooks.d/cfn-auto-reloader.conf"**
install_wordpress:
packages:
yum:
httpd: []
php: []
mysql: []
php-mysql: []
sources:
/var/www/html: "http://wordpress.org/latest.tar.gz"
services:
sysvinit:
httpd:
enabled: "true"
ensureRunning: "true"
config_wordpress:
commands:
01_clone_config:
cwd: "/var/www/html/wordpress"
test: "test ! -e /var/www/html/wordpress/wp-config.php"
command: "cp wp-config-sample.php wp-config.php"
02_inject_dbhost:
cwd: "/var/www/html/wordpress"
command: !Sub |
sed -i 's/localhost/${DBServer.Endpoint.Address}/g' wp-config.php
03_inject_dbname:
cwd: "/var/www/html/wordpress"
command: !Sub |
sed -i 's/database_name_here/${DatabaseName}/g' wp-config.php
04_inject_dbuser:
cwd: "/var/www/html/wordpress"
command: !Sub |
sed -i 's/username_here/${DatabaseUser}/g' wp-config.php
05_inject_dbpassword:
cwd: "/var/www/html/wordpress"
command: !Sub |
sed -i 's/password_here/${DatabasePassword}/g' wp-config.php
S3blob:
Type: "AWS::S3::Bucket"**
Error & log details
[root#ip-172-31-25-239 ec2-user]# cd /var/log
[root#ip-172-31-25-239 log]# ls
*audit btmp cfn-init-cmd.log cfn-wire.log cloud-init-output.log dmesg lastlog maillog ntpstats spooler wtmp
boot.log cfn-hup.log cfn-init.log cloud-init.log cron dracut.log mail messages secure tallylog yum.log
[root#ip-172-31-25-239 log]# cat cfn-hup.log
2017-12-30 10:48:15,923 [ERROR] Error: [main] section must contain stack option*
===========================================================================================
===========================================================================================
[root#ip-172-31-25-239 log]# cat cfn-init.log
2017-12-30 10:48:15,499 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.us-east-1.amazonaws.com
2017-12-30 10:48:15,501 [DEBUG] Describing resource EC2server in stack Yetagain-init-hup-try10
2017-12-30 10:48:15,616 [INFO] -----------------------Starting build-----------------------
2017-12-30 10:48:15,616 [DEBUG] Not setting a reboot trigger as scheduling support is not available
2017-12-30 10:48:15,617 [INFO] Running configSets: wordpress
2017-12-30 10:48:15,618 [INFO] Running configSet wordpress
2017-12-30 10:48:15,619 [INFO] Running config configure_cfn
2017-12-30 10:48:15,620 [DEBUG] No packages specified
2017-12-30 10:48:15,620 [DEBUG] No groups specified
2017-12-30 10:48:15,620 [DEBUG] No users specified
2017-12-30 10:48:15,620 [DEBUG] No sources specified
2017-12-30 10:48:15,620 [DEBUG] Parent directory /etc/cfn does not exist, creating
2017-12-30 10:48:15,625 [DEBUG] Writing content to /etc/cfn/cfn-hup.conf
2017-12-30 10:48:15,625 [DEBUG] Setting mode for /etc/cfn/cfn-hup.conf to 000400
2017-12-30 10:48:15,626 [DEBUG] Setting owner 0 and group 0 for /etc/cfn/cfn-hup.conf
2017-12-30 10:48:15,626 [DEBUG] Parent directory /etc/cfn/hooks.d does not exist, creating
2017-12-30 10:48:15,626 [DEBUG] Writing content to /etc/cfn/hooks.d/cfn-auto-reloader.conf
2017-12-30 10:48:15,626 [DEBUG] Setting mode for /etc/cfn/hooks.d/cfn-auto-reloader.conf to 000400
2017-12-30 10:48:15,626 [DEBUG] Setting owner 0 and group 0 for /etc/cfn/hooks.d/cfn-auto-reloader.conf
2017-12-30 10:48:15,626 [DEBUG] Parent directory /var/www/html does not exist, creating
2017-12-30 10:48:15,627 [DEBUG] Writing content to /var/www/html/index2.html
2017-12-30 10:48:15,627 [DEBUG] No mode specified for /var/www/html/index2.html. The file will be created with the mode: 0644
2017-12-30 10:48:15,627 [DEBUG] No commands specified
2017-12-30 10:48:15,627 [DEBUG] Using service modifier: /sbin/chkconfig
2017-12-30 10:48:15,627 [DEBUG] Setting service cfn-hup to enabled
2017-12-30 10:48:15,634 [INFO] enabled service cfn-hup
2017-12-30 10:48:15,635 [DEBUG] Restarting cfn-hup due to change detected in dependency
2017-12-30 10:48:15,635 [DEBUG] Using service runner: /sbin/service
*2017-12-30 10:48:15,941 [ERROR] Could not restart service cfn-hup; return code was 1
2017-12-30 10:48:15,941 [DEBUG] Service output: Stopping cfn-hup: [FAILED]
Starting cfn-hup: [FAILED]
2017-12-30 10:48:15,942 [ERROR] Error encountered during build of configure_cfn: Could not restart cfn-hup*
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 270, in build
CloudFormationCarpenter._serviceTools[manager]().apply(services, changes)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/service_tools.py", line 161, in apply
self._restart_service(service)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/service_tools.py", line 185, in _restart_service
raise ToolError("Could not restart %s" % service)
ToolError: Could not restart cfn-hup
2017-12-30 10:48:15,942 [ERROR] -----------------------BUILD FAILED!------------------------
2017-12-30 10:48:15,944 [ERROR] Unhandled exception during build: Could not restart cfn-hup
Traceback (most recent call last):
File "/opt/aws/bin/cfn-init", line 171, in <module>
worklog.build(metadata, configSets)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 129, in build
Contractor(metadata).build(configSets, self)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 530, in build
self.run_config(config, worklog)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 270, in build
CloudFormationCarpenter._serviceTools[manager]().apply(services, changes)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/service_tools.py", line 161, in apply
self._restart_service(service)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/service_tools.py", line 185, in _restart_service
raise ToolError("Could not restart %s" % service)
ToolError: Could not restart cfn-hup
=============================================================================================================
==============================================================================================================
[root#ip-172-31-25-239 log]# cat /etc/cfn/cfn-hup.conf
[main-just some name]
stack=arn:aws:cloudformation:us-east-1:523324464109:stack/Yetagain-init-hup-try10/908305e0-ed4d-11e7-b9f7-500c285ebefd
region=us-east-1
verbose=true
interval=5
==========================================================================================================
=========================================================================================================
[root#ip-172-31-25-239 log]# cat /etc/cfn/hooks.d/cfn-auto-reloader.conf
[cfn-auto-reloader-hook #just a name]
triggers=post.update
path=Resources.EC2server.Metadata.AWS::CloudFormation::Init
action=/opt/aws/bin/cfn-init -v --stack Yetagain-init-hup-try10 --resource EC2server --configsets wordpress --region us-east-1

Looks like the main error is in cfn-hup.log:
2017-12-30 10:48:15,923 [ERROR] Error: [main] section must contain stack option*
Try changing [main-just some name] to [main] in your cfn-hup.conf. For reference, my /etc/cfn/cfn-hup.conf looks like something like this:
[main]
stack=arn:aws:cloudformation:us-west-1:acccount_id:stack/mystack-dev-ecs-EC2-1VF68LZMOLAIY/cb2a6a80-554a-11e8-b318-503dcab41efa
region=us-west-1
interval=5
verbose=true

Related

When I am running playbook its showing Variable not defined

- hosts: switch
connection: network_cli
become_method: enable
gather_facts: no
vars_prompt:
- name: vlan_id
prompt: enter the vlan_id
private: no
vars:
cli:
username: admin
password: int123$%^
vlans:
100: "CORE"
200: "MONITORING"
300: "ACCESS"
400: "GUEST_WIFI"
ansible_buffer_read_timeout: 2
tasks:
- name: "creating the vlans"
ios_vlans:
config:
- vlan_id: "{{ vlan_id }}"
mtu: 700
state: active
shutdown: disabled
register: show_vlan
- debug:
var: show_vlan.stdout_lines
Output:
enter the vlan_id: 11
PLAY [switch] ****************************************************************************************************************************************************
TASK [creating the vlans] **************************************************************************************************************************************** changed: [172.16.1.252]
TASK [debug]
ok: [172.16.1.252] => show_vlan.stdout_lines: VARIABLE IS NOT DEFINED!
PLAY RECAP 172.16.1.252 : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
ios_vlans module does not have stdout_lines key in it's return values. Please check the documention here
so debug show_lan
- debug:
var: show_vlan

Stuck in the partial helm release on Terraform to Kubernetes

I'm trying to apply a terraform resource (helm_release) to k8s and the apply command is failed half way through.
I checked the pod issue now I need to update some values in the local chart.
Now I'm in a dilemma, where I can't apply the helm_release as the names are in use, and I can't destroy the helm_release since it is not created.
Seems to me the only option is to manually delete the k8s resources that were created by the helm_release chart?
Here is the terraform for helm_release:
cat nginx-arm64.tf
resource "helm_release" "nginx-ingress" {
name = "nginx-ingress"
chart = "/data/terraform/k8s/nginx-ingress-controller-arm64.tgz"
}
BTW: I need to use the local chart as the official chart does not support the ARM64 architecture.
Thanks,
Edit #1:
Here is the list of helm release -> there is no gninx ingress
/data/terraform/k8s$ helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
cert-manager default 1 2021-12-08 20:57:38.979176622 +0000 UTC deployed cert-manager-v1.5.0 v1.5.0
/data/terraform/k8s$
Here is the describe pod output:
$ k describe pod/nginx-ingress-nginx-ingress-controller-99cddc76b-62nsr
Name: nginx-ingress-nginx-ingress-controller-99cddc76b-62nsr
Namespace: default
Priority: 0
Node: ocifreevmalways/10.0.0.189
Start Time: Wed, 08 Dec 2021 11:11:59 +0000
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=nginx-ingress
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nginx-ingress-controller
helm.sh/chart=nginx-ingress-controller-9.0.9
pod-template-hash=99cddc76b
Annotations: <none>
Status: Running
IP: 10.244.0.22
IPs:
IP: 10.244.0.22
Controlled By: ReplicaSet/nginx-ingress-nginx-ingress-controller-99cddc76b
Containers:
controller:
Container ID: docker://0b75f5f68ef35dfb7dc5b90f9d1c249fad692855159f4e969324fc4e2ee61654
Image: docker.io/rancher/nginx-ingress-controller:nginx-1.1.0-rancher1
Image ID: docker-pullable://rancher/nginx-ingress-controller#sha256:177fb5dc79adcd16cb6c15d6c42cef31988b116cb148845893b6b954d7d593bc
Ports: 80/TCP, 443/TCP
Host Ports: 0/TCP, 0/TCP
Args:
/nginx-ingress-controller
--default-backend-service=default/nginx-ingress-nginx-ingress-controller-default-backend
--election-id=ingress-controller-leader
--controller-class=k8s.io/ingress-nginx
--configmap=default/nginx-ingress-nginx-ingress-controller
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Wed, 08 Dec 2021 22:02:15 +0000
Finished: Wed, 08 Dec 2021 22:02:15 +0000
Ready: False
Restart Count: 132
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: nginx-ingress-nginx-ingress-controller-99cddc76b-62nsr (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wzqqn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-wzqqn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 8m38s (x132 over 10h) kubelet Container image "docker.io/rancher/nginx-ingress-controller:nginx-1.1.0-rancher1" already present on machine
Warning BackOff 3m39s (x3201 over 10h) kubelet Back-off restarting failed container
The terraform state list shows nothing:
/data/terraform/k8s$ t state list
/data/terraform/k8s$
Though the terraform.tfstate.backup shows the nginx ingress (I guess that I did run the destroy command in between?):
/data/terraform/k8s$ cat terraform.tfstate.backup
{
"version": 4,
"terraform_version": "1.0.11",
"serial": 28,
"lineage": "30e74aa5-9631-f82f-61a2-7bdbd97c2276",
"outputs": {},
"resources": [
{
"mode": "managed",
"type": "helm_release",
"name": "nginx-ingress",
"provider": "provider[\"registry.terraform.io/hashicorp/helm\"]",
"instances": [
{
"status": "tainted",
"schema_version": 0,
"attributes": {
"atomic": false,
"chart": "/data/terraform/k8s/nginx-ingress-controller-arm64.tgz",
"cleanup_on_fail": false,
"create_namespace": false,
"dependency_update": false,
"description": null,
"devel": null,
"disable_crd_hooks": false,
"disable_openapi_validation": false,
"disable_webhooks": false,
"force_update": false,
"id": "nginx-ingress",
"keyring": null,
"lint": false,
"manifest": null,
"max_history": 0,
"metadata": [
{
"app_version": "1.1.0",
"chart": "nginx-ingress-controller",
"name": "nginx-ingress",
"namespace": "default",
"revision": 1,
"values": "{}",
"version": "9.0.9"
}
],
"name": "nginx-ingress",
"namespace": "default",
"postrender": [],
"recreate_pods": false,
"render_subchart_notes": true,
"replace": false,
"repository": null,
"repository_ca_file": null,
"repository_cert_file": null,
"repository_key_file": null,
"repository_password": null,
"repository_username": null,
"reset_values": false,
"reuse_values": false,
"set": [],
"set_sensitive": [],
"skip_crds": false,
"status": "failed",
"timeout": 300,
"values": null,
"verify": false,
"version": "9.0.9",
"wait": true,
"wait_for_jobs": false
},
"sensitive_attributes": [],
"private": "bnVsbA=="
}
]
}
]
}
When I try to apply in the same directory, it prompts the error again:
Plan: 1 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
helm_release.nginx-ingress: Creating...
╷
│ Error: cannot re-use a name that is still in use
│
│ with helm_release.nginx-ingress,
│ on nginx-arm64.tf line 1, in resource "helm_release" "nginx-ingress":
│ 1: resource "helm_release" "nginx-ingress" {
Please share your thoughts. Thanks.
Edit2:
The DEBUG logs show some more clues:
2021-12-09T04:30:14.118Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [resourceDiff: nginx-ingress] Release validated: timestamp=2021-12-09T04:30:14.118Z
2021-12-09T04:30:14.118Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [resourceDiff: nginx-ingress] Done: timestamp=2021-12-09T04:30:14.118Z
2021-12-09T04:30:14.119Z [WARN] Provider "registry.terraform.io/hashicorp/helm" produced an invalid plan for helm_release.nginx-ingress, but we are tolerating it because it is using the legacy plugin SDK.
The following problems may be the cause of any confusing errors from downstream operations:
- .cleanup_on_fail: planned value cty.False for a non-computed attribute
- .create_namespace: planned value cty.False for a non-computed attribute
- .verify: planned value cty.False for a non-computed attribute
- .recreate_pods: planned value cty.False for a non-computed attribute
- .render_subchart_notes: planned value cty.True for a non-computed attribute
- .replace: planned value cty.False for a non-computed attribute
- .reset_values: planned value cty.False for a non-computed attribute
- .disable_crd_hooks: planned value cty.False for a non-computed attribute
- .lint: planned value cty.False for a non-computed attribute
- .namespace: planned value cty.StringVal("default") for a non-computed attribute
- .skip_crds: planned value cty.False for a non-computed attribute
- .disable_webhooks: planned value cty.False for a non-computed attribute
- .force_update: planned value cty.False for a non-computed attribute
- .timeout: planned value cty.NumberIntVal(300) for a non-computed attribute
- .reuse_values: planned value cty.False for a non-computed attribute
- .dependency_update: planned value cty.False for a non-computed attribute
- .disable_openapi_validation: planned value cty.False for a non-computed attribute
- .atomic: planned value cty.False for a non-computed attribute
- .wait: planned value cty.True for a non-computed attribute
- .max_history: planned value cty.NumberIntVal(0) for a non-computed attribute
- .wait_for_jobs: planned value cty.False for a non-computed attribute
helm_release.nginx-ingress: Creating...
2021-12-09T04:30:14.119Z [INFO] Starting apply for helm_release.nginx-ingress
2021-12-09T04:30:14.119Z [INFO] Starting apply for helm_release.nginx-ingress
2021-12-09T04:30:14.119Z [DEBUG] helm_release.nginx-ingress: applying the planned Create change
2021-12-09T04:30:14.120Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] setting computed for "metadata" from ComputedKeys: timestamp=2021-12-09T04:30:14.120Z
2021-12-09T04:30:14.120Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [resourceReleaseCreate: nginx-ingress] Started: timestamp=2021-12-09T04:30:14.120Z
2021-12-09T04:30:14.120Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [resourceReleaseCreate: nginx-ingress] Getting helm configuration: timestamp=2021-12-09T04:30:14.120Z
2021-12-09T04:30:14.120Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [INFO] GetHelmConfiguration start: timestamp=2021-12-09T04:30:14.120Z
2021-12-09T04:30:14.120Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] Using kubeconfig: /home/ubuntu/.kube/config: timestamp=2021-12-09T04:30:14.120Z
2021-12-09T04:30:14.120Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [INFO] Successfully initialized kubernetes config: timestamp=2021-12-09T04:30:14.120Z
2021-12-09T04:30:14.121Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [INFO] GetHelmConfiguration success: timestamp=2021-12-09T04:30:14.121Z
2021-12-09T04:30:14.121Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [resourceReleaseCreate: nginx-ingress] Getting chart: timestamp=2021-12-09T04:30:14.121Z
2021-12-09T04:30:14.125Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [resourceReleaseCreate: nginx-ingress] Preparing for installation: timestamp=2021-12-09T04:30:14.125Z
2021-12-09T04:30:14.125Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 ---[ values.yaml ]-----------------------------------
{}: timestamp=2021-12-09T04:30:14.125Z
2021-12-09T04:30:14.125Z [INFO] provider.terraform-provider-helm_v2.4.1_x5: 2021/12/09 04:30:14 [DEBUG] [resourceReleaseCreate: nginx-ingress] Installing chart: timestamp=2021-12-09T04:30:14.125Z
╷
│ Error: cannot re-use a name that is still in use
│
│ with helm_release.nginx-ingress,
│ on nginx-arm64.tf line 1, in resource "helm_release" "nginx-ingress":
│ 1: resource "helm_release" "nginx-ingress" {
│
╵
2021-12-09T04:30:14.158Z [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2021-12-09T04:30:14.160Z [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.terraform.io/hashicorp/helm/2.4.1/linux_arm64/terraform-provider-helm_v2.4.1_x5 pid=558800
2021-12-09T04:30:14.160Z [DEBUG] provider: plugin exited
You don't have to manually delete all the resources using kubectl. Under the hood the Terraform Helm provider still uses Helm. So if you run helm list -A you will see all the Helm releases on your cluster, including the nginx-ingress release. Deleting the release is then done via helm uninstall nginx-ingress -n REPLACE_WITH_YOUR_NAMESPACE.
Before re-running terraform apply do check if the Helm release is still in your Terraform state via terraform state list (run this from the same directory as where you run terraform apply from). If you don't see helm_release.nginx-ingress in that list then it is not in your Terraform state and you can just rerun your terraform apply. Else you have to delete it via terraform state rm helm_release.nginx-ingress and then you can run terraform apply again.
Just faced a similar issue like this, but in my case nor there was terraform state for the helm, and nor there was a helm release.
so helm list -A or helm list in the current namespace does not.
I found this that solved: helm/helm#4174
With Helm 3, all releases metadata are saved as Secrets in the same
Namespace of the release. If you got "cannot re-use a name that is
still in use", this means you may need to check some orphan secrets
and delete them
and then its start working

Apache Airflow : Dag task marked zombie, with background process running on remote server

**Apache Airflow version:**1.10.9-composer
Kubernetes Version : Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-gke.6002", GitCommit:"035184604aff4de66f7db7fddadb8e7be76b6717", GitTreeState:"clean", BuildDate:"2020-12-01T23:13:35Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}
Environment: Airflow, running on top of Kubernetes - Linux version 4.19.112
OS : Linux version 4.19.112+ (builder#7fc5cdead624) (Chromium OS 9.0_pre361749_p20190714-r4 clang version 9.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm-project c11de5eada2decd0a495ea02676b6f4838cd54fb) (based on LLVM 9.0.0svn)) #1 SMP Fri Sep 4 12:00:04 PDT 2020
Kernel : Linux gke-europe-west2-asset-c-default-pool-dc35e2f2-0vgz
4.19.112+ #1 SMP Fri Sep 4 12:00:04 PDT 2020 x86_64 Intel(R) Xeon(R) CPU # 2.20GHz GenuineIntel GNU/Linux
What happened ?
A running task is marked as Zombie after the execution time crossed the latest heartbeat + 5 minutes.
The task is running in background in another application server, triggered using SSHOperator.
[2021-01-18 11:53:37,491] {taskinstance.py:888} INFO - Executing <Task(SSHOperator): load_trds_option_composite_file> on 2021-01-17T11:40:00+00:00
[2021-01-18 11:53:37,495] {base_task_runner.py:131} INFO - Running on host: airflow-worker-6f6fd78665-lm98m
[2021-01-18 11:53:37,495] {base_task_runner.py:132} INFO - Running: ['airflow', 'run', 'dsp_etrade_process_trds_option_composite_0530', 'load_trds_option_composite_file', '2021-01-17T11:40:00+00:00', '--job_id', '282759', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/dsp_etrade_trds_option_composite_0530.py', '--cfg_path', '/tmp/tmpge4_nva0']
Task Executing time:
dag_id dsp_etrade_process_trds_option_composite_0530
duration 7270.47
start_date 2021-01-18 11:53:37,491
end_date 2021-01-18 13:54:47.799728+00:00
Scheduler Logs during that time:
[2021-01-18 13:54:54,432] {taskinstance.py:1135} ERROR - <TaskInstance: dsp_etrade_process_etrd.push_run_date 2021-01-18 13:30:00+00:00 [running]> detected as zombie
{
textPayload: "[2021-01-18 13:54:54,432] {taskinstance.py:1135} ERROR - <TaskInstance: dsp_etrade_process_etrd.push_run_date 2021-01-18 13:30:00+00:00 [running]> detected as zombie"
insertId: "1ca8zyfg3zvma66"
resource: {
type: "cloud_composer_environment"
labels: {3}
}
timestamp: "2021-01-18T13:54:54.432862699Z"
severity: "ERROR"
logName: "projects/asset-control-composer-prod/logs/airflow-scheduler"
receiveTimestamp: "2021-01-18T13:54:55.714437665Z"
}
Airflow-webserver log :
X.X.X.X - - [18/Jan/2021:13:54:39 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "GoogleHC/1.0"
{
textPayload: "172.17.0.5 - - [18/Jan/2021:13:54:39 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "GoogleHC/1.0"
"
insertId: "1sne0gqg43o95n3"
resource: {2}
timestamp: "2021-01-18T13:54:45.401670481Z"
logName: "projects/asset-control-composer-prod/logs/airflow-webserver"
receiveTimestamp: "2021-01-18T13:54:50.598807514Z"
}
Airflow Info logs :
2021-01-18 08:54:47.799 EST
{
textPayload: "NoneType: None
"
insertId: "1ne3hqgg47yzrpf"
resource: {2}
timestamp: "2021-01-18T13:54:47.799661030Z"
severity: "INFO"
logName: "projects/asset-control-composer-prod/logs/airflow-scheduler"
receiveTimestamp: "2021-01-18T13:54:50.914461159Z"
}
[2021-01-18 13:54:47,800] {taskinstance.py:1192} INFO - Marking task as FAILED.dag_id=dsp_etrade_process_trds_option_composite_0530, task_id=load_trds_option_composite_file, execution_date=20210117T114000, start_date=20210118T115337, end_date=20210118T135447
Copy link
{
textPayload: "[2021-01-18 13:54:47,800] {taskinstance.py:1192} INFO - Marking task as FAILED.dag_id=dsp_etrade_process_trds_option_composite_0530, task_id=load_trds_option_composite_file, execution_date=20210117T114000, start_date=20210118T115337, end_date=20210118T135447"
insertId: "1ne3hqgg47yzrpg"
resource: {2}
timestamp: "2021-01-18T13:54:47.800605248Z"
severity: "INFO"
logName: "projects/asset-control-composer-prod/logs/airflow-scheduler"
receiveTimestamp: "2021-01-18T13:54:50.914461159Z"
}
Airflow Database shows the latest heartbeat as:
select state, latest_heartbeat from job where id=282759
--------------------------------------
state | latest_heartbeat
running | 2021-01-18 13:48:41.891934
Airflow Configurations:
celery
worker_concurrency=6
scheduler
scheduler_health_check_threshold=60
scheduler_zombie_task_threshold=300
max_threads=2
core
dag_concurrency=6
Kubernetes Cluster :
Worker nodes : 6
What was expected to happen ?
The backend process takes around 2hrs 30 minutes to finish. During
such long running jobs the task is detected as zombie. Eventhough the
worker node is still processing the task. The state of the job is
still marked as 'running'. State if the task is not known during the
run time.

How to check if an encrypted variable is decrypted?

I have an Ansible encrypted variable. Now I'd like to be able to run my playbook even when I don't unlock the variable (with --ask-vault-pass) and just skip the tasks that depend on it. Ideally with a warning saying that the task was skipped.
Now when I run my playbook without --ask-vault-pass, it fails with an error:
fatal: [...]: FAILED! => {"changed": false, "msg": "AnsibleError: An unhandled exception occurred while templating '{{ (samba_passwords |
string | from_yaml)[samba_username] }}'. Error was a <class 'ansible.parsing.vault.AnsibleVaultError'>, original message: Attempting to decrypt bu
t no vault secrets found"}
Is there a way how to check in the when: clause that an encrypted variable is not decrypted and thus inaccessible?
Q: "Check if an encrypted variable is decrypted. Skip the tasks that depend on it. Ideally with a warning saying that the task was skipped."
A: For example, given the file with the variable
shell> cat vars-test.yml
test_var1: test var1
Encrypt the file
shell> ansible-vault encrypt vars-test.yml
New Vault password:
Confirm New Vault password:
Encryption successful
shell> cat vars-test.yml
$ANSIBLE_VAULT;1.1;AES256
61373230346437306135303463393166323063656561623863306333313837666561653466393835
3738666532303836376139613766343930346263633032330a323336643061373039613330653237
30666364376266396633613162626536383161306262613062373239343232663935376364383431
6335623366613834360a336531656537626662376166323766376433653232633139383636613963
64356632633863353534323636313231633866613635343962383463636565303032
Then the playbook
shell> cat pb.yml
- hosts: test_01
tasks:
- include_vars: vars-test.yml
ignore_errors: true
- set_fact:
test_var1: "{{ test_var1|default('default') }}"
- name: Execute tasks if test_var1 was decrypted
block:
- debug:
msg: Execute task1
- debug:
msg: Execute task2
when: test_var1 != 'default'
gives (abridged)
shell> ansible-playbook pb.yml --ask-vault-pass
TASK [include_vars] ****
ok: [test_01]
TASK [set_fact] ****
ok: [test_01]
TASK [debug] ****
ok: [test_01] =>
msg: Execute task1
TASK [debug] ****
ok: [test_01] =>
msg: Execute task2
If you don't provide the command with the password the playbook gives (abridged)
shell> ansible-playbook pb.yml
PLAY [test_01] ****
TASK [include_vars] ****
fatal: [test_01]: FAILED! => changed=false
ansible_facts: {}
ansible_included_var_files: []
message: Attempting to decrypt but no vault secrets found
...ignoring
TASK [set_fact] ****
ok: [test_01]
TASK [debug] ****
skipping: [test_01]
TASK [debug] ****
skipping: [test_01]
I've researched but I haven't found anything to do that. The easy way to solve this case would be used ignore_errors: yes in the task.

Cause of KeyError: '_errors' when running salt state.highstate

After adding the following to my minion pillar file:
monit:
services:
- name: elasticsearch
pid: /var/run/elasticsearch/elasticsearch.pid
start_script: /etc/init.d/elasticsearch start
start_script: /etc/init.d/elasticsearch stop
port: 9200
I started receiving the following error when I tried to run highstate:
root#salt-master:/home/me# salt 'my-minion-id' state.highstate -t 300
my-minion-id:
The minion function caused an exception: Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1482, in _thread_return
return_data = executor.execute()
File "/usr/lib/python2.7/dist-packages/salt/executors/direct_call.py", line 28, in execute
return self.func(*self.args, **self.kwargs)
File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 848, in highstate
err += __pillar__['_errors']
File "/usr/lib/python2.7/dist-packages/salt/utils/context.py", line 211, in __getitem__
return self._dict()[key]
KeyError: '_errors'
This error was quite annoying, but it turns out that it was because my pillar file contained a dictionary with duplicate keys:
monit:
services:
- name: elasticsearch
...
start_script: /etc/init.d/elasticsearch start
start_script: /etc/init.d/elasticsearch stop
...
which should have instead been:
monit:
services:
- name: elasticsearch
...
start_script: /etc/init.d/elasticsearch start
stop_script: /etc/init.d/elasticsearch stop
...
Hope this helps save someone time!

Resources