Amazon CodeDeploy. Deployment failed - aws-code-deploy

I have been configuring AWS CodeDeploy for a few days and my first deployment is failing. The error message I get reads "The overall deployment failed because too many individual instances failed deployment, too few healthy instances are available for deployment, or some instances in your deployment group are experiencing problems."
To get more detailed info I have installed the AWS CodeDeploy agent on the Windows instance and it appears not to be working. All what I manage to read in the code-deploy-agent-log.txt file are the repetitive lines.
2016-05-31 16:05:24 DEBUG [codedeploy-agent(4872)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Sleeping 90 seconds.
2016-05-31 16:06:55 DEBUG [codedeploy-agent(4872)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Calling PollHostCommand:
2016-05-31 16:06:55 INFO [codedeploy-agent(4872)]: Version file found in C:/ProgramData/Amazon/CodeDeploy/.version.
2016-05-31 16:06:55 ERROR [codedeploy-agent(4872)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Missing credentials - please check if this instance was started with an IAM instance profile
My question is how can I get more information about the error message I am getting over the deployments. Which credentials am I missing (specifying incorrectly) that cause to the error message in the log file?

I think you are missing service-role-arn while creating your deployment group. The service role ARN allows AWS CodeDeploy to act on the user's behalf when interacting with AWS services. The service role ARN is of the code deploy role that you may have created it earlier.
In addition please make sure that your deployment policy is set to CodeDeployDefault.OneAtATime. This is to avoid taking all instances down if you push incorrect or failing build.

I tried Suken Shah's steps it didn't solve for me. What solved for me are:
1)Creating an IamInstanceProfile say Webserver.
2)Adding AWSCodeDeployRole to the IamInstanceProfile Webserver.
3)Adding the following to AWSCodeDeployRole's Trust Relationship: "codedeploy.amazonaws.com", "ec2.amazonaws.com", "codedeploy.MY_REGION.amazonaws.com"
4)Rebooting the ec2

Make sure the role you use for EC2 has 'AWSCodeDeployRole' policy and trust relationship has 'ec2.amazonaws.com' service. if you need to change the role then restart the EC2

Related

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

Google Cloud Composer The server encountered a temporary error and could not complete your request

After running for a couple of days Google Cloud Composer web UI returns the 502 Server Error indefinitely:
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
The only way to fix it is to recreate the Composer environment. Though after running for a couple of days the new environment crashes with the same error.
Image version: composer-1.4.0-airflow-1.10.0
Python version: 3
Anyone knows what's the root cause?
I don't run Cloud Composer but I suspect that there's a case where the webserver has exited from all the web worker threads. This can sometimes happen when airflow has an extended timeout reading or writing to the database; either due to a held lock, or network connection issues. It probably is configured to restart if it fully exits, but there are some cases were the airflow webserver command will still hold on without exiting even though all web workers have exited.
Alternatively the 502 is about the identity provider implemented for GCP. If that's the case you might find you need to sign out of your Google login and use the sign in flow provided by Airflow (if it responds to a private browser session or a signed out session).
I was facing the same 502 error and it turned out to be an issue with the DAG itself. As mentioned:
https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags
"The web server parses the DAG definition files, and a 502 gateway timeout can occur if there are errors in the DAG."
Visible in Composer / Monitoring
Web server was affected by an issue with the DAG itself. We solved it by deleting the recently added DAGs, after couple of minutes the Airflow UI was up.

Google Cloud Platform : Stackdriver Agent installation and Condiguration error on GCP VM instance

I dont have hands on experience on stackdriver monitoring configuration for google cloud platform VM instances monitoring. our basic monitoring for our project works fine but while trying to install stackdriver agent in Ubuntu 14.04 OS it gives us error and stack driver with agent does not works for us. below is the error for your reference.
Jan 3 10:43:42 ubuntu-uat01 collectd[2283]: write_gcm: Unsuccessful
HTTP request 403: {#012 "error": {#012 "code": 403,#012
"message": "User is not authorized to access the project monitoring
records.",#012 "status": "PERMISSION_DENIED"#012 }#012} Jan 3
10:43:42 ubuntu-uat01 collectd[2283]: write_gcm: Error -2 from
wg_curl_get_or_post Jan 3 10:43:42 ubuntu-uat01 collectd[2283]:
write_gcm: wg_transmit_unique_segment failed.
Can someone help me in setting up stackdriver monitoring with Agent installed on server or provide me some documentation link if any available.
I got this precise error on my instances until I added the permission 'Monitoring Metric Writer' to the service account.
You could also, as Igor suggested, add the monitoring api scope to the instance
See the StackDriver Monitoring docs
Most likely you either don't have the Stackdriver Monitoring API enabled in your project, or your VM does not have the correct scopes. There are extensive instructions on the Google Cloud site for installing the agent, including the troubleshooting page.
If you are installing StackDriver monitoring and logging agent on your instance, you need to make sure attached service-account to your instance has proper rights to edit/write data to StackDriver. Simply run following commands to assign proper roles:
gcloud projects add-iam-policy-binding PROJECT_NAME --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" --role="roles/logging.logWriter"
gcloud projects add-iam-policy-binding PROJECT_NAME --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" --role="roles/monitoring.metricWriter"
replace PROJECT_NAME and SERVICE_ACCOUNT_EMAIL with proper values from your environment.

MSDeploy issues (WMSVC 500 error)

Having some issues with MSDeploy on a windows server 2008 box, the internal service is throwing a 500 error without putting anything in the server's event logs.
I'm attempting to setup automated deployments using MSBuild/TeamCity/MSDeploy, and this is basically the current halting point, has anyone come across this issue before?
Thanks, Ed
To find out why you are getting this error you should enable logging.
First, enable Failed Request Tracing for the web management service. You can see how to do this by referring to the "Optional: Set Up Tracing" section of this article:
http://learn.iis.net/page.aspx/984/configure-web-deploy/
The "frebs" can be found in:
C:\inetpub\logs\wmsvc\TracingLogFiles\W3SVC1
Open each of the frXXXXXX.xml files with IE and it'll use the freb.xsl transform to generate a nice report.
Don't delete freb.xsl when you're done, it doesn't always get recreated.
Then turn on logging for the web management service:
http://technet.microsoft.com/en-us/library/ff729437(WS.10).aspx
You want to have the following registry entry configured:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\IIS Extensions\MSDeploy\1]
"EnabledTraceLevel"=dword:00000002
"EnabledTraceSources"=dword:000001ff
You can fiddle with the tracing levels/sources to increase and decrease the verbosity of the logs.
As per the article the management service logs are written to:
%WINDIR%\ServiceProfiles\LocalService\AppData\Local\Temp\WMSvc.log

Resources