Error: No response from Gunicorn master within 120 seconds . Shutting down webserver. Airflow-Webserver Service won't start - airflow

I want to monitor my airflow worker logs with the help of Prometheus.
So I looked up on the internet and found statsd-exporter can help me.
But, when I added the required configuration of statsd-exporter in Airflow.cfg , Service wont start.
Airflow.cfg file
[scheduler]
statsd_on = True
statsd_host = xxx.xxx.x.xxx
statsd_port = 9125
statsd_prefix = airflow
When I restart The service after adding these lines in Airflow.cfg , it doesnt restart and throws above error in logs.
When I remove these lines , It starts perfectly.
I installed statsd and statsd-exporter via pip.
Also can anyone suggest any better way this can be done.

Related

Airflow:Logs not showing in the UI while tasks running

Airflow Version: 2.2.4
Airflow running in EKS
Issue: Logs not showing in the UI while tasks running
The issue with the logs is that airflow is only writing the logs to the log file rather than standard out as well. This is what's preventing us from being able to see the logs in the web UI while the task is running.
When i get into the pod , i do see log inside the pod
Is there any solution to finding the setting or configuration needed to output to both?
I log as below
kubectl logs detaskdate0.3d55e5ba89ca4ad491bb3e1cadfdaaec -n airflow
Added new context arn:aws:eks:us-west-2:XXXXXXXX:cluster/us-west-2-airflow-cluster to /home/airflow/.kube/config
[2022-05-20 19:56:43,529] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags/tss/dq_tss_mod_date_dag.py

Apache Airflow 1.10.10, Remote Worker and S3 logs

I have recently upgraded airflow from 1.10.0 to 1.10.10. The current setup is web, worker, scheduler and flower are on same machine. When a DAG is run first step is it spins up new EMR for the DAG and along with it a worker node where only worker process runs. We are using celery executor. This worker node sends tasks to run on EMR cluster. Once the tasks are run next steps are terminating EMR and terminating this worker instance. Every task's log is present on this worker node. As long as the tasks are running or worker node is running, I can see the logs on web UI. But as soon as worker is terminated, I am unable to see the logs. The config is setup is to upload logs to s3. I see logs of startEMR and startWorker on S3 since these logs are main airflow instance(where all 4 processes are running)
Here is the config snippet of airflow.cfg
base_log_folder = /home/deploy/airflow/logs
remote_logging = True
remote_base_log_folder = s3://airflow-log-bucket/airflow/logs/
remote_log_conn_id = aws_default
encrypt_s3_logs = False
s3_log_folder = '/airflow/logs/'
executor = CeleryExecutor
Same config file is setup when worker instance is initialized for DAG and only worker process is started on that node.
Here is the log from a task when worker node is terminated.
*** Log file does not exist: /home/deploy/airflow/logs/XXXX/XXXXXX/2020-07-07T23:30:05+00:00/1.log
*** Fetching from: http://ip-10-164-62-253.ap-southeast-2.compute.internal:8799/log/XXXX/XXXXXX/2020-07-07T23:30:05+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='ip-10-164-62-253.ap-southeast-2.compute.internal', port=8799): Max retries exceeded with url: /log/xxxx/XXXXXX/2020-07-07T23:30:05+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6750ac52b0>: Failed to establish a new connection: [Errno 113] No route to host',))
So basically -
This was working in airflow 1.10.1 (I did not need to add remote_logging=True)
The logs are copied to S3 for EMR start and Worker Node start steps and are shown on web-UI.
Only tasks running on remote worker node are not copied to S3.
Can someone please let me know what am I missing in configuration as same config used to work on airflow1.10.0
I found the mistake I was doing. The S3 module that was getting installed on new worker node was being installed via pip and not pip3. Airflow server was having this installation from pip3.
Another config change I had to do was in webserver section of airflow.cfg file.
worker_class = sync
This was previously gevent.

Airflow (LocalExecutor) - Docker :: Job is failing with Log file does not exist

Airflow version: 1.10.9
Executor : LocalExecutor
Docket Setup
when job runs sometime we are getting following error. I have searched in web, many people faced this issue in celeryExecutor but we are using LocalExecutor(Docker setup). How can I resolve this problem?
*** Log file does not exist: /home/ubuntu/airflow/airflow/logs/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log
*** Fetching from: http://:8793/log/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log
*** Failed to fetch log file from worker. Invalid URL 'http://:8793/log/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log': No host supplied
Here is one approach I've seen when running the scheduler and webserver in their own containers and using LocalExecutor:
Mount a host log directory as a volume into both the scheduler and webserver containers:
volumes:
- /location/on/host/airflow/logs:/opt/airflow/logs
Make sure the user within the airflow containers (usually airflow) has permissions to read and write that directory. If the permissions are wrong you will see an error like the one in your post.
This probably won't scale beyond LocalExecutor usage though.

Airflow - Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url

I am running Airflowv1.9 with Celery Executor. I have 5 Airflow workers running in 5 different machines. Airflow scheduler is also running in one of these machines. I have copied the same airflow.cfg file across these 5 machines.
I have daily workflows setup in different queues like DEV, QA etc. (each worker runs with an individual queue name) which are running fine.
While scheduling a DAG in one of the worker (no other DAG have been setup for this worker/machine previously), I am seeing the error in the 1st task and as a result downstream tasks are failing:
*** Log file isn't local.
*** Fetching here: http://<worker hostname>:8793/log/PDI_Incr_20190407_v2/checkBCWatermarkDt/2019-04-07T17:00:00/1.log
*** Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url: http://<worker hostname>:8793/log/PDI_Incr_20190407_v2/checkBCWatermarkDt/2019-04-07T17:00:00/1.log
I have configured MySQL for storing the DAG metadata. When I checked task_instance table, I see proper hostnames are populated against the task.
I also checked the log location and found that the log is getting created.
airflow.cfg snippet:
base_log_folder = /var/log/airflow
base_url = http://<webserver ip>:8082
worker_log_server_port = 8793
api_client = airflow.api.client.local_client
endpoint_url = http://localhost:8080
What am I missing here? What configurations do I need to check additionally for resolving this issue?
Looks like the worker's hostname is not being correctly resolved.
Add a file hostname_resolver.py:
import os
import socket
import requests
def resolve():
"""
Resolves Airflow external hostname for accessing logs on a worker
"""
if 'AWS_REGION' in os.environ:
# Return EC2 instance hostname:
return requests.get(
'http://169.254.169.254/latest/meta-data/local-ipv4').text
# Use DNS request for finding out what's our external IP:
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(('1.1.1.1', 53))
external_ip = s.getsockname()[0]
s.close()
return external_ip
And export: AIRFLOW__CORE__HOSTNAME_CALLABLE=airflow.hostname_resolver:resolve
The web program of the master needs to go to the worker to fetch the log and display it on the front-end page. This process is to find the host name of the worker. Obviously, the host name cannot be found,Therefore, add the host name to IP mapping on the master's vim /etc/hosts
If this happens as part of a Docker Compose Airflow setup, the hostname resolution needs to be passed to the container hosting the webserver, e.g. through extra_hosts:
# docker-compose.yml
version: "3.9"
services:
webserver:
extra_hosts:
- "worker_hostname_0:192.168.xxx.yyy"
- "worker_hostname_1:192.168.xxx.zzz"
...
...
More details here.

Airflow: New DAG is not found by webserver

In Airflow, how should I handle the error "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metadata database"?
I've copied a new DAG to an Airflow server, and have tried:
unpausing it and refreshing it (basic operating procedure, given in this previous answer https://stackoverflow.com/a/42291683/160406)
restarting the webserver
restarting the scheduler
stopping the webserver and scheduler, resetting the database (airflow resetdb), then starting the webserver and scheduler again
running airflow backfill (suggested here Airflow "This DAG isnt available in the webserver DagBag object ")
running airflow trigger_dag
The scheduler log shows it being processed and no errors occurring, I can interact with it and view it's state through the CLI, but it still does not appear in the web UI.
Edit: the webserver and scheduler are running on the same machine with the same airflow.cfg. They're not running in Docker.
They're run by Supervisor, which runs them both as the same user (airflow). The airflow user has read, write and execute permission on all of the dag files.
This helped me...
pkill -9 -f "airflow scheduler"
pkill -9 -f "airflow webserver"
pkill -9 -f "gunicorn"
then restart the airflow scheduler and webserver.
Just had this issue myself. After changing permissions, resetting the meta database, restarting the webserver & even making some potential code changes to rectify the situation, it didn't happen.
However, I noticed that even though we were stopping the webserver, our gunicorn process was still running. Killing these processes & then starting everything back up resulted in success
I had the same problem on an airflow installed from a Docker image
What I did was:
1- delete all files .pyc
2- delete Metadata databse using :
for t in ["xcom", "task_instance", "sla_miss", "log", "job", "dag_run", "dag" ]:
sql="delete from {} where dag_id='{}'".format(t, dag_input)
hook.run(sql, True)
3- restart webserver & scheduler
4- Execute airflow updatedb
It resolved the problem for me.
if the airflow_home - dags_folder config parameter is same for scheduler, webUI and the command line interface the only cause for the error:
This DAG isn't available in the webserver DagBag object
can be file permissions or error in python script.
Please check
Run the dag as normal python script and check for errors
User in airflow.cfg and the one creating the dag should be same or the dag should have execute permission for the airflow user
With Airflow 1.9 I don't experience the problem with zombie gunicorn processes.
I do a simple restart: systemctl restart airflow-webserver and it forces webserver to refresh DAG status.

Resources