Airflow ECS-Operator not fetching CloudWatch Logs - airflow

I'm using Airflow's EcsOperator, ECS tasks writing to Cloudwatch.
Sometimes Airflow log fetcher collects logs from CloudWatch and sometimes does not.
On the CloudWatch console, I always see the logs.
On tasks that take a long time, I usually see the log or at least part of it.
Someone had the same issue with ECSOperator?

First ECSOperator is deprecated and removed in provider version 5.0.0
You should switch to EcsRunTaskOperator.
In EcsRunTaskOperator there is awslogs_fetch_interval which control over the interval to fetch logs from Ecs. The default is 30 seconds. If you wish for more frequent polls then set the parameter value accordingly.
You didn't mention what provider version you are on but this part of the code was refactored in version 5.0.0 (PR) so upgrading the Amazon provider might also resolve your issue.

Related

GCP's PubSub yields "no available instance" error with scheduled tasks

I would like to setup a scheduled task to run every hour and say send emails. In a Firebase project, I've setup a function with .schedule('every 60 minutes') and can see it successfully in the GCP portal as well. Recently, this error started to show up in the logs: "The request was aborted because there was no available instance. Additional troubleshooting documentation can be found at: https://cloud.google.com/functions/docs/troubleshooting#scalability" tied to this scheduled function.
There are plenty of conversations online that I saw tied to "no available instance" and mainly due to traffic spikes etc.. But for a scheduled task, what can I do to avoid this? My main goal is to not have to worry if I have 3 tasks to run the next hour or 300. I want to auto-scale and do the magic that it does. Am I missing something here?
Thanks

Datastore to Firestore(datastore mode) automatic migration: request timeout while accessing datastore after REDIRECT_STRONGLY_CONSISTENT_READS step

I am using Objectify(v5) for accessing Datastore, in App Engine Standard Environment, with Java. Entities are cached by Objectify automatically, and I am also using Memcache separately.
This issue of Datastore APIs timing out started happening exactly post REDIRECT_STRONGLY_CONSISTENT_READS migration step. Strangely, it happens roughly after every hour or 2 hours, lasts for 3-4 mins and then gets back to normal. Since requests latency is going up from ~200ms to more than 60s, lots of new instances are getting created and am getting charged heavily.
here are some of the errors:
com.google.api.server.spi.SystemService invokeServiceMethod: exception occurred while calling backend method
java.util.concurrent.CancellationException: Task was cancelled.
at com.google.common.util.concurrent.AbstractFuture.cancellationExceptionWithCause(AbstractFuture.java:1550)
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:590)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:467)
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:122)
at com.google.appengine.tools.development.TimedFuture.get(TimedFuture.java:55)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:68)
at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:89)
at com.google.appengine.api.datastore.Batcher$ReorderingMultiFuture.get(Batcher.java:114)
at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:89)
at com.googlecode.objectify.cache.TriggerFuture.get(TriggerFuture.java:100)
at com.googlecode.objectify.impl.ResultAdapter.now(ResultAdapter.java:34)
at com.googlecode.objectify.util.ResultWrapper.translate(ResultWrapper.java:22)
at com.googlecode.objectify.util.ResultWrapper.translate(ResultWrapper.java:10)
at com.googlecode.objectify.util.ResultTranslator.nowUncached(ResultTranslator.java:21)
at com.googlecode.objectify.util.ResultCache.now(ResultCache.java:30)
at com.googlecode.objectify.util.ResultWrapper.translate(ResultWrapper.java:22)
at com.googlecode.objectify.util.ResultWrapper.translate(ResultWrapper.java:10)
at com.googlecode.objectify.util.ResultTranslator.nowUncached(ResultTranslator.java:21)
at com.googlecode.objectify.util.ResultCache.now(ResultCache.java:30)
and
com.googlecode.objectify.cache.EntityMemcache getAll: Error obtaining cache for [<dummy-entity-name>]
java.util.concurrent.CancellationException: Task was cancelled.
and
java.lang.InterruptedException
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:460)
I have tried upgrading to latest versions:
implementation 'com.google.appengine:appengine-api-1.0-sdk:2.0.5'
implementation 'javax.servlet:javax.servlet-api:3.1.0'
implementation 'com.googlecode.objectify:objectify:5.1.25'
and also flushed memcache. nothing has worked.
Has anyone faced this issue?
It turned out to be an issue with migration itself. Reached out to Google Cloud support(paid), Datastore/Firestore Product team paused the migration, which I could not do it myself by the way, wasn't allowed for my project. They did some fixing and completed the migration. Now Datastore queries as running as expected.

How to improve Cloud composer health?

I recently built 120 dags using cloud composer. They all functioned for a while.
They were all approximately the same. Each used python operator. Each made API calls to google search console. Each collected 7-9k rows of GSC data into a pandas dataframe, then uploaded this to GCS buckets and BigQuery (partitioned and clustered).
Occasionally I'd have all fail one day because the GSC auth token had been revoked, but no problem, create new credentials, upload and continue. That situation lasted a couple of months. Now nothing runs.
From the start, the cloud composer health had occasional red spots, but now the health is static red every day.
I have found documentation about how to check the health, but not how to find why the health is so poor and fix it.
Can anyone point me in the right direction?
The environment health metric depends on a Composer-managed DAG named airflow_monitoring which is triggered periodically by the airflow-monitoring pod. If this DAG isn't deleted, you can check the airflow-monitoring logs to see if there are any problems related to reading the DAG's run statuses. Consequently, you can also try troubleshooting the error in Cloud Logging using the filter:
resource.type="cloud_composer_environment"
severity=ERROR
The liveness check failure could be due to the following reasons:
Any resource constraint(Memory and CPU)
Known issue with the composer version. Please check composer
release
notes for any
known issues.
Airflow configuration as core:default_timezone(If you’ve
configured core: default_timezone airflow configuration composer
environment health will be shown as unhealthy. It is a known
issue and the composer product team is working on the resolution.)
Refer to this documentation for information on Cloud Composer’s environment health metric.
I was lucky enough to talk to someone from Google yesterday who said what I need to do is recreate my cloud composer environment because I have insufficient CPU. He suggested the flexible choice when recreating.

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

Resources