I have configured my GKE environment with high resources (separately in node pools too) and the DAGs are set with the KubernetesPodOperator to only launch pods in those node pools.
affinity={
'nodeAffinity': {
'requiredDuringSchedulingIgnoredDuringExecution': {
'nodeSelectorTerms': [{
'matchExpressions': [{
'key': 'cloud.google.com/gke-nodepool',
'operator': 'In',
'values': [
'spawning-pool'
]
}]
}]
}
}
}
My airflow.cfg has also been modified to raise the concurrency for all the various (confusing) airflow config parameters:
parallelism = 100
dag_concurrency = 100
max_active_runs_per_dag = 100
However, many of my active DAGs have their tasks in 'Queued' state and are not starting:
Do I have to restart Composer to trigger the airflow.cfg changes or something else I am missing?
EDIT:
just a thought, but maybe this will give some ideas to resolving the bug.
I have been modifying my dag.py files whilst there are tasks running (i.e. I have CI/CD flowing into Composer's GCS dags bucket).
Could it be that the DAG is only re-parsed when no tasks are running for that dag?
So the DAG code mentioning to use the new node pool is not parsed as there are DAG tasks running.
Composer and Airflow versions:
In the Composer console in GCP it says:
Image version: composer-1.7.2-airflow-1.10.2
Related
I have recently upgraded airflow from 1.10.0 to 1.10.10. The current setup is web, worker, scheduler and flower are on same machine. When a DAG is run first step is it spins up new EMR for the DAG and along with it a worker node where only worker process runs. We are using celery executor. This worker node sends tasks to run on EMR cluster. Once the tasks are run next steps are terminating EMR and terminating this worker instance. Every task's log is present on this worker node. As long as the tasks are running or worker node is running, I can see the logs on web UI. But as soon as worker is terminated, I am unable to see the logs. The config is setup is to upload logs to s3. I see logs of startEMR and startWorker on S3 since these logs are main airflow instance(where all 4 processes are running)
Here is the config snippet of airflow.cfg
base_log_folder = /home/deploy/airflow/logs
remote_logging = True
remote_base_log_folder = s3://airflow-log-bucket/airflow/logs/
remote_log_conn_id = aws_default
encrypt_s3_logs = False
s3_log_folder = '/airflow/logs/'
executor = CeleryExecutor
Same config file is setup when worker instance is initialized for DAG and only worker process is started on that node.
Here is the log from a task when worker node is terminated.
*** Log file does not exist: /home/deploy/airflow/logs/XXXX/XXXXXX/2020-07-07T23:30:05+00:00/1.log
*** Fetching from: http://ip-10-164-62-253.ap-southeast-2.compute.internal:8799/log/XXXX/XXXXXX/2020-07-07T23:30:05+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='ip-10-164-62-253.ap-southeast-2.compute.internal', port=8799): Max retries exceeded with url: /log/xxxx/XXXXXX/2020-07-07T23:30:05+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6750ac52b0>: Failed to establish a new connection: [Errno 113] No route to host',))
So basically -
This was working in airflow 1.10.1 (I did not need to add remote_logging=True)
The logs are copied to S3 for EMR start and Worker Node start steps and are shown on web-UI.
Only tasks running on remote worker node are not copied to S3.
Can someone please let me know what am I missing in configuration as same config used to work on airflow1.10.0
I found the mistake I was doing. The S3 module that was getting installed on new worker node was being installed via pip and not pip3. Airflow server was having this installation from pip3.
Another config change I had to do was in webserver section of airflow.cfg file.
worker_class = sync
This was previously gevent.
I upgraded the airflow version from 1.7.3 to 1.10.1. After up-gradation of the scheduler, webserver and workers, the dags have stopped working showing below error on scheduler-
Either the dag did not exist or it failed to parse.
I have not made any changes to the config. While investigating the issue the scheduler logs shows the issue. Earlier the scheduler run the task as -
Adding to queue: airflow run <dag_id> <task_id> <execution_date> --local -sd DAGS_FOLDER/<dag_filename.py>
While now it is running with absolute path -
Adding to queue: airflow run <dag_id> <task_id> <execution_date> --local -sd /<PATH_TO_DAGS_FOLDER>/<dag_filename.py>
PATH_TO_DAGS_FOLDER is like /home/<user>/Airflow/dags...
which is same as what it is pushing it to workers by since worker is running on some other user it is not able to find the dag location specified.
I am not sure how to tell the worker to look in it's own airflow home dir and not the scheduler one?
I am using mysql as backend and rabbitmq for message passing.
In my first foray into airflow, I am trying to run one of the example DAGS that comes with the installation. This is v.1.8.0. Here are my steps:
$ airflow trigger_dag example_bash_operator
[2017-04-19 15:32:38,391] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:32:38,676] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
[2017-04-19 15:32:38,947] {cli.py:185} INFO - Created <DagRun example_bash_operator # 2017-04-19 15:32:38: manual__2017-04-19T15:32:38, externally triggered: True>
$ airflow dag_state example_bash_operator '2017-04-19 15:32:38'
[2017-04-19 15:33:12,918] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:33:13,229] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
running
The dag state remains "running" for a long time (at least 20 minutes by now), although from a quick inspection of this task it should take a matter of seconds. How can I troubleshoot this? How can I see which step it is stuck on?
To run any DAGs, you need to make sure two processes are running:
airflow webserver
airflow scheduler
If you only have airflow webserver running, the UI will show DAGs as running, but if you click on the DAG, none of it's tasks are actually running or scheduled, but rather in a Null state.
What this means is that they are waiting to be picked up by airflow scheduler. If airflow scheduler is not running, you'll be stuck in this state forever, as the tasks are never picked up for execution.
Additionally, make sure that the toggle button in the DAGs view is switched to 'ON' for the particular DAG. Otherwise it will not get picked up by the scheduler if you trigger it manually.
I too recently started using Airflow and my dags kept endlessly running. Your dag may be set on 'pause' without you realizing it, and thus the scheduler will not schedule new task instances and when you trigger the dag it just looks like it is endlessly running.
There are a few solutions:
1) In the Airflow UI toggle the button left of the dag from 'Off' to 'On'. Off means that the dag is paused, so On will allow the scheduler to pick it up and complete the dag. (this fixed my initial issue)
2) In your airflow.cfg file dags_are_paused_at_creation = True, is the default. So all new dags you create are paused from the start. Change this to False, and future dags you create will be good to go right away (i had to reboot webserver and scheduler for changes to the airflow.cfg to be recognized)
3) use the command line $ airflow unpause [dag_id]
documentation: https://airflow.apache.org/cli.html#unpause
The below worked for me.
Make sure AIRFLOW_HOME is set
in AIRFLOW_HOME have folders dags, plugins. The folders to have permissions r,w,x to airflow user.
Make sure u have atleast one dag in the dags/ folder.
pip install celery[redis]==4.1.1
I have checked the above soln on airflow 1.9.0 Airflow version
I tried the same trick with airflow 1.10 version and it worked.
In Airflow, how should I handle the error "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metadata database"?
I've copied a new DAG to an Airflow server, and have tried:
unpausing it and refreshing it (basic operating procedure, given in this previous answer https://stackoverflow.com/a/42291683/160406)
restarting the webserver
restarting the scheduler
stopping the webserver and scheduler, resetting the database (airflow resetdb), then starting the webserver and scheduler again
running airflow backfill (suggested here Airflow "This DAG isnt available in the webserver DagBag object ")
running airflow trigger_dag
The scheduler log shows it being processed and no errors occurring, I can interact with it and view it's state through the CLI, but it still does not appear in the web UI.
Edit: the webserver and scheduler are running on the same machine with the same airflow.cfg. They're not running in Docker.
They're run by Supervisor, which runs them both as the same user (airflow). The airflow user has read, write and execute permission on all of the dag files.
This helped me...
pkill -9 -f "airflow scheduler"
pkill -9 -f "airflow webserver"
pkill -9 -f "gunicorn"
then restart the airflow scheduler and webserver.
Just had this issue myself. After changing permissions, resetting the meta database, restarting the webserver & even making some potential code changes to rectify the situation, it didn't happen.
However, I noticed that even though we were stopping the webserver, our gunicorn process was still running. Killing these processes & then starting everything back up resulted in success
I had the same problem on an airflow installed from a Docker image
What I did was:
1- delete all files .pyc
2- delete Metadata databse using :
for t in ["xcom", "task_instance", "sla_miss", "log", "job", "dag_run", "dag" ]:
sql="delete from {} where dag_id='{}'".format(t, dag_input)
hook.run(sql, True)
3- restart webserver & scheduler
4- Execute airflow updatedb
It resolved the problem for me.
if the airflow_home - dags_folder config parameter is same for scheduler, webUI and the command line interface the only cause for the error:
This DAG isn't available in the webserver DagBag object
can be file permissions or error in python script.
Please check
Run the dag as normal python script and check for errors
User in airflow.cfg and the one creating the dag should be same or the dag should have execute permission for the airflow user
With Airflow 1.9 I don't experience the problem with zombie gunicorn processes.
I do a simple restart: systemctl restart airflow-webserver and it forces webserver to refresh DAG status.
I need some advice on how to restart all airflow services on deploy without killing the workers in the middle of a task.
I've written a deployment procedure for my DAGs which installs airflow and any other pip dependencies in a virtualenv. Once my release directory is ready, I:
stop airflow-flower, airflow-worker, airflow-scheduler, and airflow-webserver
Update the "current" simlink to point to my new release
Start airflow-flower, airflow-worker, airflow-scheduler, and airflow-webserver
The problem with this deploy procedure is that the workers get killed immediately. I'd like to add some sort of monitoring to the script to pause all DAGs, wait for the workers to idle, then restart the services, but the airflow CLI has no way to learn which dags are enabled nor whether the workers are idle.
I understand that many of the airflow services can auto-detect changes in the dags folder, but I want each deployment to have its own virtualenv. If I don't restart all services then a new deployment won't pick up a new line in my requirements.txt file.
You have access to the Airflow DB so consider developing a deployment script that does this process for you.
Update the DAG table to pause all DAGs
Read the TASK_INSTANCE table to wait until all RUNNING state tasks complete
Restart Airflow services.
Update the DAG table to unpause DAGs.
Airflow workers gracefully quit from a SIGINT. Update your process monitor to quit with SIGINT instead of the default. If you're using systemctl, then it will look something like this:
...
[Service]
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=...
KillSignal=SIGINT
Restart=on-failure
RestartSec=10s
...