I need some advice on how to restart all airflow services on deploy without killing the workers in the middle of a task.
I've written a deployment procedure for my DAGs which installs airflow and any other pip dependencies in a virtualenv. Once my release directory is ready, I:
stop airflow-flower, airflow-worker, airflow-scheduler, and airflow-webserver
Update the "current" simlink to point to my new release
Start airflow-flower, airflow-worker, airflow-scheduler, and airflow-webserver
The problem with this deploy procedure is that the workers get killed immediately. I'd like to add some sort of monitoring to the script to pause all DAGs, wait for the workers to idle, then restart the services, but the airflow CLI has no way to learn which dags are enabled nor whether the workers are idle.
I understand that many of the airflow services can auto-detect changes in the dags folder, but I want each deployment to have its own virtualenv. If I don't restart all services then a new deployment won't pick up a new line in my requirements.txt file.
You have access to the Airflow DB so consider developing a deployment script that does this process for you.
Update the DAG table to pause all DAGs
Read the TASK_INSTANCE table to wait until all RUNNING state tasks complete
Restart Airflow services.
Update the DAG table to unpause DAGs.
Airflow workers gracefully quit from a SIGINT. Update your process monitor to quit with SIGINT instead of the default. If you're using systemctl, then it will look something like this:
...
[Service]
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=...
KillSignal=SIGINT
Restart=on-failure
RestartSec=10s
...
Related
Hi I'm currently running airflow on a Dataproc cluster. My DAGs used to run fine but facing this issue where tasks are ending up in 'retry' state without any logs when I click on task instance -> logs on airflow UI
I see the following error in terminal where I started the airflow webserver
2022-06-24 07:30:36.544 [ERROR] Executor reports task instance
<TaskInstance: **task name** 2022-06-23 07:00:00+00:00 [queued]> finished (failed)
although the task says its queued. Was the task killed externally?
None
[2022-06-23 06:08:33,202] {models.py:1758} INFO - Marking task as UP_FOR_RETRY
2022-06-23 06:08:33.202 [INFO] Marking task as UP_FOR_RETRY
What I tried so far
restarted webserver
Started server from 3 different ports
re-ran backfill command with 3 different timestamps
deleted dag runs for my dag, created a new dag run and then re-ran backfill command
cleared the PID as mentioned here How do I restart airflow webserver? and restarted the webserver
None of these worked. This issue is persistent for the past two days, appreciate any help here.At this point I'm guessing this is to do with a shared DB but not sure how to fix this.
<<update>> So what I also found is these tasks eventually go to success or failure state. when that happens the logs are available, but still no logs for the retry attempts in $airflow_home or our remote directory
The issue was there was another celery worker listening on the same queue. since this second worker was not configured properly it was failing the task and not writing the logs to remote location.
Airflow Version: 2.2.4
Airflow running in EKS
Issue: Logs not showing in the UI while tasks running
The issue with the logs is that airflow is only writing the logs to the log file rather than standard out as well. This is what's preventing us from being able to see the logs in the web UI while the task is running.
When i get into the pod , i do see log inside the pod
Is there any solution to finding the setting or configuration needed to output to both?
I log as below
kubectl logs detaskdate0.3d55e5ba89ca4ad491bb3e1cadfdaaec -n airflow
Added new context arn:aws:eks:us-west-2:XXXXXXXX:cluster/us-west-2-airflow-cluster to /home/airflow/.kube/config
[2022-05-20 19:56:43,529] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags/tss/dq_tss_mod_date_dag.py
I upgraded the airflow version from 1.7.3 to 1.10.1. After up-gradation of the scheduler, webserver and workers, the dags have stopped working showing below error on scheduler-
Either the dag did not exist or it failed to parse.
I have not made any changes to the config. While investigating the issue the scheduler logs shows the issue. Earlier the scheduler run the task as -
Adding to queue: airflow run <dag_id> <task_id> <execution_date> --local -sd DAGS_FOLDER/<dag_filename.py>
While now it is running with absolute path -
Adding to queue: airflow run <dag_id> <task_id> <execution_date> --local -sd /<PATH_TO_DAGS_FOLDER>/<dag_filename.py>
PATH_TO_DAGS_FOLDER is like /home/<user>/Airflow/dags...
which is same as what it is pushing it to workers by since worker is running on some other user it is not able to find the dag location specified.
I am not sure how to tell the worker to look in it's own airflow home dir and not the scheduler one?
I am using mysql as backend and rabbitmq for message passing.
In my first foray into airflow, I am trying to run one of the example DAGS that comes with the installation. This is v.1.8.0. Here are my steps:
$ airflow trigger_dag example_bash_operator
[2017-04-19 15:32:38,391] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:32:38,676] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
[2017-04-19 15:32:38,947] {cli.py:185} INFO - Created <DagRun example_bash_operator # 2017-04-19 15:32:38: manual__2017-04-19T15:32:38, externally triggered: True>
$ airflow dag_state example_bash_operator '2017-04-19 15:32:38'
[2017-04-19 15:33:12,918] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:33:13,229] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
running
The dag state remains "running" for a long time (at least 20 minutes by now), although from a quick inspection of this task it should take a matter of seconds. How can I troubleshoot this? How can I see which step it is stuck on?
To run any DAGs, you need to make sure two processes are running:
airflow webserver
airflow scheduler
If you only have airflow webserver running, the UI will show DAGs as running, but if you click on the DAG, none of it's tasks are actually running or scheduled, but rather in a Null state.
What this means is that they are waiting to be picked up by airflow scheduler. If airflow scheduler is not running, you'll be stuck in this state forever, as the tasks are never picked up for execution.
Additionally, make sure that the toggle button in the DAGs view is switched to 'ON' for the particular DAG. Otherwise it will not get picked up by the scheduler if you trigger it manually.
I too recently started using Airflow and my dags kept endlessly running. Your dag may be set on 'pause' without you realizing it, and thus the scheduler will not schedule new task instances and when you trigger the dag it just looks like it is endlessly running.
There are a few solutions:
1) In the Airflow UI toggle the button left of the dag from 'Off' to 'On'. Off means that the dag is paused, so On will allow the scheduler to pick it up and complete the dag. (this fixed my initial issue)
2) In your airflow.cfg file dags_are_paused_at_creation = True, is the default. So all new dags you create are paused from the start. Change this to False, and future dags you create will be good to go right away (i had to reboot webserver and scheduler for changes to the airflow.cfg to be recognized)
3) use the command line $ airflow unpause [dag_id]
documentation: https://airflow.apache.org/cli.html#unpause
The below worked for me.
Make sure AIRFLOW_HOME is set
in AIRFLOW_HOME have folders dags, plugins. The folders to have permissions r,w,x to airflow user.
Make sure u have atleast one dag in the dags/ folder.
pip install celery[redis]==4.1.1
I have checked the above soln on airflow 1.9.0 Airflow version
I tried the same trick with airflow 1.10 version and it worked.
In Airflow, how should I handle the error "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metadata database"?
I've copied a new DAG to an Airflow server, and have tried:
unpausing it and refreshing it (basic operating procedure, given in this previous answer https://stackoverflow.com/a/42291683/160406)
restarting the webserver
restarting the scheduler
stopping the webserver and scheduler, resetting the database (airflow resetdb), then starting the webserver and scheduler again
running airflow backfill (suggested here Airflow "This DAG isnt available in the webserver DagBag object ")
running airflow trigger_dag
The scheduler log shows it being processed and no errors occurring, I can interact with it and view it's state through the CLI, but it still does not appear in the web UI.
Edit: the webserver and scheduler are running on the same machine with the same airflow.cfg. They're not running in Docker.
They're run by Supervisor, which runs them both as the same user (airflow). The airflow user has read, write and execute permission on all of the dag files.
This helped me...
pkill -9 -f "airflow scheduler"
pkill -9 -f "airflow webserver"
pkill -9 -f "gunicorn"
then restart the airflow scheduler and webserver.
Just had this issue myself. After changing permissions, resetting the meta database, restarting the webserver & even making some potential code changes to rectify the situation, it didn't happen.
However, I noticed that even though we were stopping the webserver, our gunicorn process was still running. Killing these processes & then starting everything back up resulted in success
I had the same problem on an airflow installed from a Docker image
What I did was:
1- delete all files .pyc
2- delete Metadata databse using :
for t in ["xcom", "task_instance", "sla_miss", "log", "job", "dag_run", "dag" ]:
sql="delete from {} where dag_id='{}'".format(t, dag_input)
hook.run(sql, True)
3- restart webserver & scheduler
4- Execute airflow updatedb
It resolved the problem for me.
if the airflow_home - dags_folder config parameter is same for scheduler, webUI and the command line interface the only cause for the error:
This DAG isn't available in the webserver DagBag object
can be file permissions or error in python script.
Please check
Run the dag as normal python script and check for errors
User in airflow.cfg and the one creating the dag should be same or the dag should have execute permission for the airflow user
With Airflow 1.9 I don't experience the problem with zombie gunicorn processes.
I do a simple restart: systemctl restart airflow-webserver and it forces webserver to refresh DAG status.