GCP's PubSub yields "no available instance" error with scheduled tasks - firebase

I would like to setup a scheduled task to run every hour and say send emails. In a Firebase project, I've setup a function with .schedule('every 60 minutes') and can see it successfully in the GCP portal as well. Recently, this error started to show up in the logs: "The request was aborted because there was no available instance. Additional troubleshooting documentation can be found at: https://cloud.google.com/functions/docs/troubleshooting#scalability" tied to this scheduled function.
There are plenty of conversations online that I saw tied to "no available instance" and mainly due to traffic spikes etc.. But for a scheduled task, what can I do to avoid this? My main goal is to not have to worry if I have 3 tasks to run the next hour or 300. I want to auto-scale and do the magic that it does. Am I missing something here?
Thanks

Related

Airflow ECS-Operator not fetching CloudWatch Logs

I'm using Airflow's EcsOperator, ECS tasks writing to Cloudwatch.
Sometimes Airflow log fetcher collects logs from CloudWatch and sometimes does not.
On the CloudWatch console, I always see the logs.
On tasks that take a long time, I usually see the log or at least part of it.
Someone had the same issue with ECSOperator?
First ECSOperator is deprecated and removed in provider version 5.0.0
You should switch to EcsRunTaskOperator.
In EcsRunTaskOperator there is awslogs_fetch_interval which control over the interval to fetch logs from Ecs. The default is 30 seconds. If you wish for more frequent polls then set the parameter value accordingly.
You didn't mention what provider version you are on but this part of the code was refactored in version 5.0.0 (PR) so upgrading the Amazon provider might also resolve your issue.

How to improve Cloud composer health?

I recently built 120 dags using cloud composer. They all functioned for a while.
They were all approximately the same. Each used python operator. Each made API calls to google search console. Each collected 7-9k rows of GSC data into a pandas dataframe, then uploaded this to GCS buckets and BigQuery (partitioned and clustered).
Occasionally I'd have all fail one day because the GSC auth token had been revoked, but no problem, create new credentials, upload and continue. That situation lasted a couple of months. Now nothing runs.
From the start, the cloud composer health had occasional red spots, but now the health is static red every day.
I have found documentation about how to check the health, but not how to find why the health is so poor and fix it.
Can anyone point me in the right direction?
The environment health metric depends on a Composer-managed DAG named airflow_monitoring which is triggered periodically by the airflow-monitoring pod. If this DAG isn't deleted, you can check the airflow-monitoring logs to see if there are any problems related to reading the DAG's run statuses. Consequently, you can also try troubleshooting the error in Cloud Logging using the filter:
resource.type="cloud_composer_environment"
severity=ERROR
The liveness check failure could be due to the following reasons:
Any resource constraint(Memory and CPU)
Known issue with the composer version. Please check composer
release
notes for any
known issues.
Airflow configuration as core:default_timezone(If you’ve
configured core: default_timezone airflow configuration composer
environment health will be shown as unhealthy. It is a known
issue and the composer product team is working on the resolution.)
Refer to this documentation for information on Cloud Composer’s environment health metric.
I was lucky enough to talk to someone from Google yesterday who said what I need to do is recreate my cloud composer environment because I have insufficient CPU. He suggested the flexible choice when recreating.

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

Batch Jobs Not Running When Set to Waiting on My Dev Server

My level of experience with the product is basic at best, but I'm expected to be a developer; I have a basic understanding of many things.
Right now my job is to investigate canceling lines in Purchase Orders. We have a workflow set up to handle those, and I'm trying to duplicate the scenario in my dev instance. Whenever a user cancels a line, the workflow is supposed to engage, and I've found that a batch job is what triggers that workflow to work (maybe that's the case with all workflows, but I don't know that for sure).
I've set up my personal Dev AX Instance under System Configuration => System => Server Configuration to use my personal Dev AOS server that my client is also running under, but when I go to System Configuration => Batch Jobs => Batch Jobs, then find the Batch Job I've been looking for and set the status to Waiting, the Batch Job never runs.
On our Test instance, the jobs is configured exactly the same way, except they use the AOS Server allotted for it.
I did a SQL script to change the batch job to use my personal Dev AOS Server, then did a restart of the Dynamics AX Servers.
There must be something I'm doing wrong for my personal dev instance. I've been reading some things from here about what may be going on and following down the list, but I'm pretty sure the problem is even stupider => https://www.daxrunbase.com/2017/07/02/troubleshooting-batch-jobs-in-ax/
First of all, do you have all 3 workflow jobs set up?
Workflow message processing
Workflow due date processing
Workflow line-item notifications
They can be set up from System administration > Setup > Workflow > Workflow infrastructure configuration.
Secondly, it is OK for the periodic batch jobs to have status Waiting. They will be in status Executing for a short time and then they will be Waiting for the next run. If the Scheduled start date/time value in this batch job is in the past, that could be a problem. Otherwise everything is OK.
Lastly, if you have already ticked the Is batch server check-box in System administration > Setup > System > Server configuration, please also make sure to move the workflow batch group in the Batch server groups section in the same form from Remaining groups to Selected groups.
The batch jobs should start at Scheduled start date/time - or a bit later, you'd need to wait a minute and refresh the grid.

Resources