Google Cloud Composer (Apache Airflow) cannot access log files - airflow

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.

We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.

I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs

I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.

In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.

This issue is resolved at least since Airflow version: 1.10.10+composer.

Related

airflow logs: break long logs into smaller multiple files

I see that airflow logs are stored at
base_log_folder/dag_id/task_id/date_time/1.log
i.e:
base_log_folder/dag_id={dag_id}/run_id={run_id}/task_id={task_id}/attempt={try_number}.log
Sometime my logs are huge and know its now a good idea to check them from the web ui, because the chrome cant handle so much size of logs.
I have access to the server and can check the logs.
So how can i break the longs into smaller files - v
i.e
{try_number}_1.log
{try_number}_2.log
{try_number}_3.log
...
Also noted that the log file {trynumber}.log, is only created when the task is completed.
while the task is running i can check the logs in the webui, but i dont see any file in the corresponding log folder.
So i need two things for logging from the server side:
break large log files into smaller files
see the log file live while the task is running, not only after the task is completed
In Airflow 2.4.0 there is an option to view full logs or only the first fragment thus huge logs are not loaded automatically:
Starting Airflow 2.5.0 the web UI also does auto tails for logs (PR)
Airflow does show live logs. If you will set for example a Sensor task that pokes resource you will see the poking attempts in the log when the task is running. It's important to note that there are local logs and remote logs (docs):
In the Airflow UI, remote logs take precedence over local logs when remote logging is enabled. If remote logs can not be found or accessed, local logs will be displayed. Note that logs are only sent to remote storage once a task is complete (including failure). In other words, remote logs for running tasks are unavailable (but local logs are available).
Huge logs are often a sign of not using log levels. If you have entries relevant for debugging then set DEBUG mode rather than INFO mode that way you can better control over the log size displayed in the UI using the AIRFLOW__LOGGING__LOGGING_LEVEL variable.

Airflow ECS-Operator not fetching CloudWatch Logs

I'm using Airflow's EcsOperator, ECS tasks writing to Cloudwatch.
Sometimes Airflow log fetcher collects logs from CloudWatch and sometimes does not.
On the CloudWatch console, I always see the logs.
On tasks that take a long time, I usually see the log or at least part of it.
Someone had the same issue with ECSOperator?
First ECSOperator is deprecated and removed in provider version 5.0.0
You should switch to EcsRunTaskOperator.
In EcsRunTaskOperator there is awslogs_fetch_interval which control over the interval to fetch logs from Ecs. The default is 30 seconds. If you wish for more frequent polls then set the parameter value accordingly.
You didn't mention what provider version you are on but this part of the code was refactored in version 5.0.0 (PR) so upgrading the Amazon provider might also resolve your issue.

How to improve Cloud composer health?

I recently built 120 dags using cloud composer. They all functioned for a while.
They were all approximately the same. Each used python operator. Each made API calls to google search console. Each collected 7-9k rows of GSC data into a pandas dataframe, then uploaded this to GCS buckets and BigQuery (partitioned and clustered).
Occasionally I'd have all fail one day because the GSC auth token had been revoked, but no problem, create new credentials, upload and continue. That situation lasted a couple of months. Now nothing runs.
From the start, the cloud composer health had occasional red spots, but now the health is static red every day.
I have found documentation about how to check the health, but not how to find why the health is so poor and fix it.
Can anyone point me in the right direction?
The environment health metric depends on a Composer-managed DAG named airflow_monitoring which is triggered periodically by the airflow-monitoring pod. If this DAG isn't deleted, you can check the airflow-monitoring logs to see if there are any problems related to reading the DAG's run statuses. Consequently, you can also try troubleshooting the error in Cloud Logging using the filter:
resource.type="cloud_composer_environment"
severity=ERROR
The liveness check failure could be due to the following reasons:
Any resource constraint(Memory and CPU)
Known issue with the composer version. Please check composer
release
notes for any
known issues.
Airflow configuration as core:default_timezone(If you’ve
configured core: default_timezone airflow configuration composer
environment health will be shown as unhealthy. It is a known
issue and the composer product team is working on the resolution.)
Refer to this documentation for information on Cloud Composer’s environment health metric.
I was lucky enough to talk to someone from Google yesterday who said what I need to do is recreate my cloud composer environment because I have insufficient CPU. He suggested the flexible choice when recreating.

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

Google Cloud Composer The server encountered a temporary error and could not complete your request

After running for a couple of days Google Cloud Composer web UI returns the 502 Server Error indefinitely:
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
The only way to fix it is to recreate the Composer environment. Though after running for a couple of days the new environment crashes with the same error.
Image version: composer-1.4.0-airflow-1.10.0
Python version: 3
Anyone knows what's the root cause?
I don't run Cloud Composer but I suspect that there's a case where the webserver has exited from all the web worker threads. This can sometimes happen when airflow has an extended timeout reading or writing to the database; either due to a held lock, or network connection issues. It probably is configured to restart if it fully exits, but there are some cases were the airflow webserver command will still hold on without exiting even though all web workers have exited.
Alternatively the 502 is about the identity provider implemented for GCP. If that's the case you might find you need to sign out of your Google login and use the sign in flow provided by Airflow (if it responds to a private browser session or a signed out session).
I was facing the same 502 error and it turned out to be an issue with the DAG itself. As mentioned:
https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags
"The web server parses the DAG definition files, and a 502 gateway timeout can occur if there are errors in the DAG."
Visible in Composer / Monitoring
Web server was affected by an issue with the DAG itself. We solved it by deleting the recently added DAGs, after couple of minutes the Airflow UI was up.

Resources