airflow logs: break long logs into smaller multiple files - airflow

I see that airflow logs are stored at
base_log_folder/dag_id/task_id/date_time/1.log
i.e:
base_log_folder/dag_id={dag_id}/run_id={run_id}/task_id={task_id}/attempt={try_number}.log
Sometime my logs are huge and know its now a good idea to check them from the web ui, because the chrome cant handle so much size of logs.
I have access to the server and can check the logs.
So how can i break the longs into smaller files - v
i.e
{try_number}_1.log
{try_number}_2.log
{try_number}_3.log
...
Also noted that the log file {trynumber}.log, is only created when the task is completed.
while the task is running i can check the logs in the webui, but i dont see any file in the corresponding log folder.
So i need two things for logging from the server side:
break large log files into smaller files
see the log file live while the task is running, not only after the task is completed

In Airflow 2.4.0 there is an option to view full logs or only the first fragment thus huge logs are not loaded automatically:
Starting Airflow 2.5.0 the web UI also does auto tails for logs (PR)
Airflow does show live logs. If you will set for example a Sensor task that pokes resource you will see the poking attempts in the log when the task is running. It's important to note that there are local logs and remote logs (docs):
In the Airflow UI, remote logs take precedence over local logs when remote logging is enabled. If remote logs can not be found or accessed, local logs will be displayed. Note that logs are only sent to remote storage once a task is complete (including failure). In other words, remote logs for running tasks are unavailable (but local logs are available).
Huge logs are often a sign of not using log levels. If you have entries relevant for debugging then set DEBUG mode rather than INFO mode that way you can better control over the log size displayed in the UI using the AIRFLOW__LOGGING__LOGGING_LEVEL variable.

Related

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

Google Cloud Composer The server encountered a temporary error and could not complete your request

After running for a couple of days Google Cloud Composer web UI returns the 502 Server Error indefinitely:
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
The only way to fix it is to recreate the Composer environment. Though after running for a couple of days the new environment crashes with the same error.
Image version: composer-1.4.0-airflow-1.10.0
Python version: 3
Anyone knows what's the root cause?
I don't run Cloud Composer but I suspect that there's a case where the webserver has exited from all the web worker threads. This can sometimes happen when airflow has an extended timeout reading or writing to the database; either due to a held lock, or network connection issues. It probably is configured to restart if it fully exits, but there are some cases were the airflow webserver command will still hold on without exiting even though all web workers have exited.
Alternatively the 502 is about the identity provider implemented for GCP. If that's the case you might find you need to sign out of your Google login and use the sign in flow provided by Airflow (if it responds to a private browser session or a signed out session).
I was facing the same 502 error and it turned out to be an issue with the DAG itself. As mentioned:
https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags
"The web server parses the DAG definition files, and a 502 gateway timeout can occur if there are errors in the DAG."
Visible in Composer / Monitoring
Web server was affected by an issue with the DAG itself. We solved it by deleting the recently added DAGs, after couple of minutes the Airflow UI was up.

Batch Jobs Not Running When Set to Waiting on My Dev Server

My level of experience with the product is basic at best, but I'm expected to be a developer; I have a basic understanding of many things.
Right now my job is to investigate canceling lines in Purchase Orders. We have a workflow set up to handle those, and I'm trying to duplicate the scenario in my dev instance. Whenever a user cancels a line, the workflow is supposed to engage, and I've found that a batch job is what triggers that workflow to work (maybe that's the case with all workflows, but I don't know that for sure).
I've set up my personal Dev AX Instance under System Configuration => System => Server Configuration to use my personal Dev AOS server that my client is also running under, but when I go to System Configuration => Batch Jobs => Batch Jobs, then find the Batch Job I've been looking for and set the status to Waiting, the Batch Job never runs.
On our Test instance, the jobs is configured exactly the same way, except they use the AOS Server allotted for it.
I did a SQL script to change the batch job to use my personal Dev AOS Server, then did a restart of the Dynamics AX Servers.
There must be something I'm doing wrong for my personal dev instance. I've been reading some things from here about what may be going on and following down the list, but I'm pretty sure the problem is even stupider => https://www.daxrunbase.com/2017/07/02/troubleshooting-batch-jobs-in-ax/
First of all, do you have all 3 workflow jobs set up?
Workflow message processing
Workflow due date processing
Workflow line-item notifications
They can be set up from System administration > Setup > Workflow > Workflow infrastructure configuration.
Secondly, it is OK for the periodic batch jobs to have status Waiting. They will be in status Executing for a short time and then they will be Waiting for the next run. If the Scheduled start date/time value in this batch job is in the past, that could be a problem. Otherwise everything is OK.
Lastly, if you have already ticked the Is batch server check-box in System administration > Setup > System > Server configuration, please also make sure to move the workflow batch group in the Batch server groups section in the same form from Remaining groups to Selected groups.
The batch jobs should start at Scheduled start date/time - or a bit later, you'd need to wait a minute and refresh the grid.

Control-m batch job is spanning mutliple versions of a singleton ActiveEx server

as part of a batch job I create 4 command lines through control-m which invoke a legacy console application written in VB6. The console application invokes an ActiveEx server which performs a set of analytic jobs calculating outputs. The ActiveEx server was coded as a singleton but when invoked through control-m I get 4 instances running. the ActiveEx server does not tear down once the job has completed and the command line has closed it self.
I created 4 .bat files which once launced manually on the server, simulate the calls made through control-m and the ActiveEx server behaves as expected, i.e. there is only 1 instance ever running and once complete it closes down gracefully.
What am I doing wrong?
Control-M jobs are run under a service account and it same as we login as a user and execute a job. How did you test this? Did you manually executed each batch job one after another or you have executed all the batch job at the same time from different terminals? You can do one thing. Run the control-M jobs with a time interval like first one at 09.00 second one at 09.05, third one at 09.10 and forth one at 09.15 and see if that fix your issue.
Maybe your job cannot use the Desktop environment.
Check your agent service settings:
Log on As:
User account under which Control‑M Agent service will run.
Valid values:
Local System Account – Service logs on as the system account.
Allow Service to Interact with Desktop – This option is valid only if the service is running as a local system account.
Selected – the service provides a user interface on a desktop that can
be used by whoever is logged in when the service is started. Default.
Unselected – the service does not provide a user interface.
This Account – User account under which Control‑M Agent service will run.
NOTE: If the owner of any Control-M/Server jobs has a "roaming profile" or if job output (OUTPUT) will be copied to or from other computers, the Log in mode must be set to This Account.
Default: Local System Account

SSRS Scheduled Reports - No HTTP Log entries

I'm looking to get some stats on the parameter retrieval from the HTTP Logs in production. We already get them when we normally execute the reports via the Report Manager or via a url. But when running the report via a schedule (we have them set up in production for the warm ups) nothing is logging....
We also tried a report via schedule in our test env and got no http entries.. :(
Scheduled reports run as SQL Server Agents jobs, so I wouldn't be at all surprised if they don't hit ReportServer, hence no HTTP logs.
Have you looked into querying the execution log? It's got quite detailed information, and will contain entries for every report execution.

Resources