I got Target group not attached to the Auto Scaling group error while doing blue green deployment through code deploy - aws-code-deploy

I have been trying blue green deployment with code deploy, but it throws an error: The following validation error occurred: Target group not attached to the Auto Scaling group (Service: AmazonAutoScaling; Status Code: 400; Error Code: ValidationError; Request ID: cd58091b-fe83-4dcf-b090-18c3b3d2dbbc; Proxy: null)
Though the policy has been applied to create target groups:
codedeploy:GetDeployment
elasticloadbalancing:DescribeTargetGroups
autoscaling:AttachLoadBalancers
autoscaling:AttachLoadBalancerTargetGroups
Does anyone could help me to sort out the issue & what am i missing?
The following is the error i encounter.
error

In our case, we managed to fix it the hard way by contacting AWS support team. Briefly about our app, we run Magento application behind an application load balancer with autoscaling, and the deployment is managed using AWS CodeDeploy on blue/green deployment.
We spent several days figuring out what's going on. Others suggested that there might be issue with IAM permissions, but we didn't touch that for months and deployment has never had any issues.
AWS' rep replied to us and said that in our case, there is a known issue / limitation on AWS Codedeploy that it currently don't support Blue/Green deployments based on ASGs that use Target Tracking scaling policies, because currently they don't attach the Green ASG to the original target group, and this is a requirement when Target Tracking scaling policies is enabled on the autoscaling group.
We then realized that we did some minor changes on our autoscaling groups' dynamic scaling policies that we switched from "CPU utilization"-based metrics to "Request count". Reverting it back to CPU utilization-based metrics solved the issue and we can run the deployment successfully.
Hope it helps as this error seems not to be documented in AWS doc.

Related

How to improve Cloud composer health?

I recently built 120 dags using cloud composer. They all functioned for a while.
They were all approximately the same. Each used python operator. Each made API calls to google search console. Each collected 7-9k rows of GSC data into a pandas dataframe, then uploaded this to GCS buckets and BigQuery (partitioned and clustered).
Occasionally I'd have all fail one day because the GSC auth token had been revoked, but no problem, create new credentials, upload and continue. That situation lasted a couple of months. Now nothing runs.
From the start, the cloud composer health had occasional red spots, but now the health is static red every day.
I have found documentation about how to check the health, but not how to find why the health is so poor and fix it.
Can anyone point me in the right direction?
The environment health metric depends on a Composer-managed DAG named airflow_monitoring which is triggered periodically by the airflow-monitoring pod. If this DAG isn't deleted, you can check the airflow-monitoring logs to see if there are any problems related to reading the DAG's run statuses. Consequently, you can also try troubleshooting the error in Cloud Logging using the filter:
resource.type="cloud_composer_environment"
severity=ERROR
The liveness check failure could be due to the following reasons:
Any resource constraint(Memory and CPU)
Known issue with the composer version. Please check composer
release
notes for any
known issues.
Airflow configuration as core:default_timezone(If you’ve
configured core: default_timezone airflow configuration composer
environment health will be shown as unhealthy. It is a known
issue and the composer product team is working on the resolution.)
Refer to this documentation for information on Cloud Composer’s environment health metric.
I was lucky enough to talk to someone from Google yesterday who said what I need to do is recreate my cloud composer environment because I have insufficient CPU. He suggested the flexible choice when recreating.

Private Endpoint in ACI not available for germanywestcentral?

I am facing a issue that containers which should only have private endpoints cannot be deployed
"The requested resource is not available in the location 'germanywestcentral' at this moment. Please retry with a different resource request or in another location. Resource requested: '1' CPU '1.5' GB memory 'Linux' OS virtual network"
The same container is working is fine as soon as I select public interface.
I didn't find anything about it in documentation or internet, so maybe someone here has an idea?
Thanks in advance
Stefan
Tried with QuickStart Image and DockerHub Registry as well getting the same result.
This error indicates that due to heavy load in the region in which you are attempting to deploy, the resources specified for your container can't be allocated at that time. Use one or more of the following mitigation steps to help resolve your issue.
1.Deploy to a different Azure region
2.Deploy at a later time
3.You can also reach out to support for details information.
Refernce : https://learn.microsoft.com/en-us/answers/questions/394616/trying-to-re-deploy-a-container-instance-but-error.html

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

MSDeploy issues (WMSVC 500 error)

Having some issues with MSDeploy on a windows server 2008 box, the internal service is throwing a 500 error without putting anything in the server's event logs.
I'm attempting to setup automated deployments using MSBuild/TeamCity/MSDeploy, and this is basically the current halting point, has anyone come across this issue before?
Thanks, Ed
To find out why you are getting this error you should enable logging.
First, enable Failed Request Tracing for the web management service. You can see how to do this by referring to the "Optional: Set Up Tracing" section of this article:
http://learn.iis.net/page.aspx/984/configure-web-deploy/
The "frebs" can be found in:
C:\inetpub\logs\wmsvc\TracingLogFiles\W3SVC1
Open each of the frXXXXXX.xml files with IE and it'll use the freb.xsl transform to generate a nice report.
Don't delete freb.xsl when you're done, it doesn't always get recreated.
Then turn on logging for the web management service:
http://technet.microsoft.com/en-us/library/ff729437(WS.10).aspx
You want to have the following registry entry configured:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\IIS Extensions\MSDeploy\1]
"EnabledTraceLevel"=dword:00000002
"EnabledTraceSources"=dword:000001ff
You can fiddle with the tracing levels/sources to increase and decrease the verbosity of the logs.
As per the article the management service logs are written to:
%WINDIR%\ServiceProfiles\LocalService\AppData\Local\Temp\WMSvc.log

Resources