I have few DAGs running the cluster, couple of DAGs succeeded after 4 attempts, but both of them not showing the previous attempts logs.
Failed to fetch log file from worker. 404 client Error: NOT FOUND for url:http://airflow-worker:8793/log/s3-copy/2022-04-16T11:30:00+00:00/1.log
Any ideas why Airflow UI is not displaying the logs? i also check the worker logs directory, it doesn't have the previous attempt logs.
Related
I see that airflow logs are stored at
base_log_folder/dag_id/task_id/date_time/1.log
i.e:
base_log_folder/dag_id={dag_id}/run_id={run_id}/task_id={task_id}/attempt={try_number}.log
Sometime my logs are huge and know its now a good idea to check them from the web ui, because the chrome cant handle so much size of logs.
I have access to the server and can check the logs.
So how can i break the longs into smaller files - v
i.e
{try_number}_1.log
{try_number}_2.log
{try_number}_3.log
...
Also noted that the log file {trynumber}.log, is only created when the task is completed.
while the task is running i can check the logs in the webui, but i dont see any file in the corresponding log folder.
So i need two things for logging from the server side:
break large log files into smaller files
see the log file live while the task is running, not only after the task is completed
In Airflow 2.4.0 there is an option to view full logs or only the first fragment thus huge logs are not loaded automatically:
Starting Airflow 2.5.0 the web UI also does auto tails for logs (PR)
Airflow does show live logs. If you will set for example a Sensor task that pokes resource you will see the poking attempts in the log when the task is running. It's important to note that there are local logs and remote logs (docs):
In the Airflow UI, remote logs take precedence over local logs when remote logging is enabled. If remote logs can not be found or accessed, local logs will be displayed. Note that logs are only sent to remote storage once a task is complete (including failure). In other words, remote logs for running tasks are unavailable (but local logs are available).
Huge logs are often a sign of not using log levels. If you have entries relevant for debugging then set DEBUG mode rather than INFO mode that way you can better control over the log size displayed in the UI using the AIRFLOW__LOGGING__LOGGING_LEVEL variable.
I have the directory for my dag in the airflow/dags directory, and when calling airflow dags list while logged into the webserver, the dag's ID shows up in the list. However, calling airflow dags list while logged into the scheduler returns the following error:
Killed
command terminated with exit code 137
The dag also does not show up in list on the webserver UI. When manually entering the dag_id in the url, it shows up with every task in the right place, but triggering a manual run via the Trigger DAG button results in a pop_up stating Cannot find dag <dag_id>. Has anyone run into this issue before? Is this a memory problem?
My DAG code is written in python, and the resulting DAG object has a large number of tasks (>80).
Running on airflow 1.10.15 with a kubernetes executor
I am using Google Cloud Composer for Apache Airflow. Trying to mark success to one of the failed job, But it is giving "Exception: Ooops" error.
When I am trying from UI, DAG page. It is not giving any instance of job which needs to be marked as success. But when I see from ( browse >> Tasks Instances ), I can see Failed tasks.
No tasks being listed down when trying to mark the failed job as success
And when I go to ( browse >> Tasks Instances ), Select the failed instances, and try to (With Selected >> Set state to success). It is giving Oops Exception.
How to mark the failed job as success in Airflow, So predecessor can be triggered?
I am trying to place multiple instances of the same dag in google cloud composer dags folder and trying to run them in airflow. What is happening is that apart from the first instance all other instances are failing and not producing any logs. These instances do not even go in the running state and fail instantly after getting queued.
And since there are no logs i cant even debug anything.
After running for a couple of days Google Cloud Composer web UI returns the 502 Server Error indefinitely:
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
The only way to fix it is to recreate the Composer environment. Though after running for a couple of days the new environment crashes with the same error.
Image version: composer-1.4.0-airflow-1.10.0
Python version: 3
Anyone knows what's the root cause?
I don't run Cloud Composer but I suspect that there's a case where the webserver has exited from all the web worker threads. This can sometimes happen when airflow has an extended timeout reading or writing to the database; either due to a held lock, or network connection issues. It probably is configured to restart if it fully exits, but there are some cases were the airflow webserver command will still hold on without exiting even though all web workers have exited.
Alternatively the 502 is about the identity provider implemented for GCP. If that's the case you might find you need to sign out of your Google login and use the sign in flow provided by Airflow (if it responds to a private browser session or a signed out session).
I was facing the same 502 error and it turned out to be an issue with the DAG itself. As mentioned:
https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags
"The web server parses the DAG definition files, and a 502 gateway timeout can occur if there are errors in the DAG."
Visible in Composer / Monitoring
Web server was affected by an issue with the DAG itself. We solved it by deleting the recently added DAGs, after couple of minutes the Airflow UI was up.