How to improve Cloud composer health? - airflow

I recently built 120 dags using cloud composer. They all functioned for a while.
They were all approximately the same. Each used python operator. Each made API calls to google search console. Each collected 7-9k rows of GSC data into a pandas dataframe, then uploaded this to GCS buckets and BigQuery (partitioned and clustered).
Occasionally I'd have all fail one day because the GSC auth token had been revoked, but no problem, create new credentials, upload and continue. That situation lasted a couple of months. Now nothing runs.
From the start, the cloud composer health had occasional red spots, but now the health is static red every day.
I have found documentation about how to check the health, but not how to find why the health is so poor and fix it.
Can anyone point me in the right direction?

The environment health metric depends on a Composer-managed DAG named airflow_monitoring which is triggered periodically by the airflow-monitoring pod. If this DAG isn't deleted, you can check the airflow-monitoring logs to see if there are any problems related to reading the DAG's run statuses. Consequently, you can also try troubleshooting the error in Cloud Logging using the filter:
resource.type="cloud_composer_environment"
severity=ERROR
The liveness check failure could be due to the following reasons:
Any resource constraint(Memory and CPU)
Known issue with the composer version. Please check composer
release
notes for any
known issues.
Airflow configuration as core:default_timezone(If you’ve
configured core: default_timezone airflow configuration composer
environment health will be shown as unhealthy. It is a known
issue and the composer product team is working on the resolution.)
Refer to this documentation for information on Cloud Composer’s environment health metric.

I was lucky enough to talk to someone from Google yesterday who said what I need to do is recreate my cloud composer environment because I have insufficient CPU. He suggested the flexible choice when recreating.

Related

GCP's PubSub yields "no available instance" error with scheduled tasks

I would like to setup a scheduled task to run every hour and say send emails. In a Firebase project, I've setup a function with .schedule('every 60 minutes') and can see it successfully in the GCP portal as well. Recently, this error started to show up in the logs: "The request was aborted because there was no available instance. Additional troubleshooting documentation can be found at: https://cloud.google.com/functions/docs/troubleshooting#scalability" tied to this scheduled function.
There are plenty of conversations online that I saw tied to "no available instance" and mainly due to traffic spikes etc.. But for a scheduled task, what can I do to avoid this? My main goal is to not have to worry if I have 3 tasks to run the next hour or 300. I want to auto-scale and do the magic that it does. Am I missing something here?
Thanks

Airflow ECS-Operator not fetching CloudWatch Logs

I'm using Airflow's EcsOperator, ECS tasks writing to Cloudwatch.
Sometimes Airflow log fetcher collects logs from CloudWatch and sometimes does not.
On the CloudWatch console, I always see the logs.
On tasks that take a long time, I usually see the log or at least part of it.
Someone had the same issue with ECSOperator?
First ECSOperator is deprecated and removed in provider version 5.0.0
You should switch to EcsRunTaskOperator.
In EcsRunTaskOperator there is awslogs_fetch_interval which control over the interval to fetch logs from Ecs. The default is 30 seconds. If you wish for more frequent polls then set the parameter value accordingly.
You didn't mention what provider version you are on but this part of the code was refactored in version 5.0.0 (PR) so upgrading the Amazon provider might also resolve your issue.

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

Firebase Remote Config A/B testing shows no results after 24 hours

I configured Firebase Remote Config A/B testing for Android, and we did rollout on at least 10K devices.
For some reason, I see "0 users" in my A/B test after more than 24 hours.
Firebase GMS version is: 11.8.0
Should it show A/B participants in real-time or it's ok to see 0 users after 24 hours?
P.S: We are able to get AB test variants on test devices through Firebase Instance Id, it works well.
The simplest experiment which is running has only app package as a target, with no additional filters. And it shows 0 users as well.
Finally, we found an answer!
Maybe somebody will find it helpful:
For now, it happens (no data in Firebase remote config A/B test experiment) if you have an activation event configured for A/B test experiment.
If you have 2 different experiments, both will fail to get results even if you have "activation event" configured only in 1 of them.
Additionally, remote config will not work as well, you'll be able to get only default values.
We already reported to Google about, so they'll fix it at some point I hope.
Another useful info which is really hard to get:
How long is it ok to see "0 Total Users" in experiment I've just
started?
It takes many hours before you can see any data in your experiment. We were able to see results only after 21 hours after experiment start, so if you configured everything well, don't worry and wait for at least 24 hours. It will show 0 "Total Users" for many hours after the start.
Should I use app versionName or versionCode in "Version" field of
experiment setup?
You should use versionName.
Some useful info from support:
Firebase SDK
Make sure your users have the version of your app with the latest SDK.
Since your experiment is with Remote Config
When activateFetched() is called, all events from that point on will be tagged with the experiment. If you have a goal or activation event that happens before activateFetched(), such as automatic events like first_open, session_start, etc., the experiment setup might be wrong.
Are you using an Activation Event?
Make sure to call fetch() and activateFetched() before the activation event occurs.
Experiment ID of the experiments (if support asks you about)
It's the number at the end of the URL while viewing experiment results.
This debugging log could be useful to get what is going on
Also:
The good way to check if your experiment is working now is to set it to a specific version you didn't publish yet and check logs from remote config with the fresh app install(or erase all app data & restart).
It should show different variant every time you reinstall the app, since your Firebase Instance ID changes after app reinstall/app data erase.
If you see variants change - then A/B test is running well.
In your "build.graddle": don't forget to set the same versionName which you set in experiment setup.
In my case, I was receiving results of A/B testing but suddenly, it stopped to appear. It had continued for 7 days and then results appeared. Firebase Support manager said:
what I suspected here is just a delay in showing the result in the
experiments
Additionally, she said that
With that, I would suggest always using the latest SDK version and
enabling Google Analytics data sharing.
In my case, I used I wasn't using the latest SDK version, but Google Analytics was enabled for "Benchmarking", "Technical Support", "Account Specialists" except for "Google products & services". I believe these settings were enabled by default (the screenshot from Google Analytics):

Resources