We run Airflow on AWS ECS, and bundle all DAGs in a Docker Image. From time to time, we update DAGS, and deploy a new version of the Docker Image. When we do so, ECS will kill running Airflow webserver / scheduler / workers, and start new ones. We ran into issues that some DAGs, which started before the deployment, were still marked as "running" after the deployment even though their tasks have been killed by the deployment, and some externalSensor tasks were left hanging.
Is there a way to gracefully shut down Airflow ? Ideally, it could mark all running tasks as up_for_retry, and rerun them after the deployment is finished ?
docker stop <container-id/name>
This does warm shutdown of celery workers
worker_1 | worker: Warm shutdown (MainProcess)
worker_1 |
worker_1 | -------------- celery#155afae87458 v4.1.1 (latentcall)
worker_1 | ---- **** -----
worker_1 | --- * *** * -- Linux-4.9.93-linuxkit-aufs-x86_64-with-debian-9.5 2018-10-18 16:52:08
worker_1 | -- * - **** ---
worker_1 | - ** ---------- [config]
.......
Related
I have created a production docker image using breeze command line tool provided. However when I run the airflow worker command, I get the following message on the command line.
Breeze command:
./breeze build-image --production-image --python 3.7 --additional-extras=jdbc --additional-python-deps="pandas pymysql" --additional-runtime-apt-deps="default-jre-headless"
Can anyone help on how to move the worker out of development server?
airflow-worker_1 | Starting flask
airflow-worker_1 | * Serving Flask app "airflow.utils.serve_logs" (lazy loading)
airflow-worker_1 | * Environment: production
airflow-worker_1 | WARNING: This is a development server. Do not use it in a production deployment.
airflow-worker_1 | Use a production WSGI server instead.
airflow-worker_1 | * Debug mode: off
airflow-worker_1 | [2021-02-08 21:57:58,409] {_internal.py:113} INFO - * Running on http://0.0.0.0:8793/ (Press CTRL+C to quit)
here is a discussion by airflow maintainer in github: https://github.com/apache/airflow/discussions/18519
It's harmless. It's an internal server run by the executor to share logs with the webserver. It's been already corrected in main to use 'production` setup (thought it's not REALLY) needed i this case as the log "traffic" and characteristics is not production-webserver like.
The fix will be released in Airflow 2.2 (~ month from now) .
Airflow stopped running tasks all of a sudden. Below are all running
airflow scheduler
airflow webserver
airflow worker
webui message
All dependencies are met but the task instance is not running. In most
cases this just means that the task will probably be scheduled soon
unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow
administrator for assistance.
Scheduler seems to be in a loop, keeps repeating the below messages. WebUI shows tasks are in queued state. Tried restarting the scheduler, didn't help.
[2018-11-17 22:03:45,809] {{jobs.py:1607}} DEBUG - Starting Loop...
[2018-11-17 22:03:45,809] {{jobs.py:1627}} INFO - Heartbeating the process manager
[2018-11-17 22:03:45,810] {{jobs.py:1662}} INFO - Heartbeating the executor
[2018-11-17 22:03:45,810] {{base_executor.py:103}} DEBUG - 124 running task instances
[2018-11-17 22:03:45,810] {{base_executor.py:104}} DEBUG - 0 in queue
[2018-11-17 22:03:45,810] {{base_executor.py:105}} DEBUG - 76 open slots
[2018-11-17 22:03:45,810] {{base_executor.py:132}} DEBUG - Calling the <class 'airflow.executors.celery_executor.CeleryExecutor'> sync method
[2018-11-17 22:03:45,810] {{celery_executor.py:80}} DEBUG - Inquiring about 124 celery task(s)
Airflow setup:
apache-airflow[celery, redis, all]==1.9.0
I also checked these posts but didn't help me:
Airflow 1.9.0 is queuing but not launching tasks
Airflow tasks get stuck at "queued" status and never gets running
Problem solved. This is a problem when you create your build on or after 2018-11-15 Turns out apache-airflow[celery, redis, all]==1.9.0 takes the latest version of redis-py 3.0.1 which does not work with celery 4.2.1.
Solution is to use redis-py 2.10.6
redis==2.10.6
apache-airflow[celery, all]==1.9.0
I have set up airflow to be running in a distributed mode with 10 worker nodes. I tried to access the performance of the parallel workloads by triggering a test dag which contains just 1 task which just sleeps for 3 seconds and then comes out.
I triggered the dag using the command
airflow backfill test_dag -s 2015-06-20 -e 2015-07-10
The scheduler kicks of the jobs/dags in parallel and frequently I see the below o/p:
[2017-06-27 09:52:29,611] {models.py:4024} INFO - Updating state for considering 1 task(s)
[2017-06-27 09:52:29,647] {models.py:4024} INFO - Updating state for considering 1 task(s)
[2017-06-27 09:52:29,664] {jobs.py:1983} INFO - [backfill progress] | finished run 19 of 21 | tasks waiting: 0 | succeeded: 19 | kicked_off: 2 | failed: 0 | skipped: 0 | deadlocked: 0 | not ready: 0
Here the kicked_off:2 indicates that 2 tasks are kicked off but when I see the UI for the status of the dag runs, I see 2 instances of dags to be running. When i look into the respective task instance log, it indicates that the task has been successfully completed but still the above message is displayed in the command prompt infinitely
[2017-06-27 09:52:29,611] {models.py:4024} INFO - Updating state for considering 1 task(s)
[2017-06-27 09:52:29,647] {models.py:4024} INFO - Updating state for considering 1 task(s)
[2017-06-27 09:52:29,664] {jobs.py:1983} INFO - [backfill progress] | finished run 19 of 21 | tasks waiting: 0 | succeeded: 19 | kicked_off: 2 | failed: 0 | skipped: 0 | deadlocked: 0 | not ready: 0
Is it that the messages which are being sent by the worker is getting dropped and hence the status is not getting updated?
Is there any parameter in airflow.cfg file which allows the failed jobs like these to be retried on other worker nodes instead of infinately waiting for the message for the worker node which is responsible for executing the aobe failed tasks..
All,
I have an issue, how to let bosj errand on existed vms,
I have a vm deployed a mongodb, and i want to run some errand command in the vm, but i don't know how?
Do any one know this?
Director task 2121
Task 2121 done
+---------+---------+---------+--------------+
| VM | State | VM Type | IPs |
+---------+---------+---------+--------------+
| app-p/0 | stopped | app | 10.62.90.171 |
+---------+---------+---------+--------------+
How to run an errand:
bosh run errand ERRAND_NAME [--download-logs] [--logs-dir DESTINATION_DIRECTORY]
ref: http://bosh.io/docs/sysadmin-commands.html#errand
Assuming you have already marked the deployment job 'lifecycle' as 'errand' in your manifest
ref: Jobs Block section of http://bosh.io/docs/deployment-manifest.html
Like Ben Moss mentioned in his comment, a new VM is created for running an errand job. The errand VM does not stay around after the errand completes. Errand is a short lived job and can be run multiple times
I have an OBIEE 11g installation in a Red Hat machine, but I'm finding problems to make it running. I can start WebLogic and its services, so I’m able to enter the WebLogic console and Enterprise Manager, but problems come when I try to start OBIEE components with opmnctl command.
The steps I’m performing are the following:
1) Start WebLogic
cd /home/Oracle/Middleware/user_projects/domains/bifoundation_domain/bin/
./startWebLogic.sh
2) Start NodeManager
cd /home/Oracle/Middleware/wlserver_10.3/server/bin/
./startNodeManager.sh
3) Start Managed WebLogic
cd /home/Oracle/Middleware/user_projects/domains/bifoundation_domain/bin/
./startManagedWebLogic.sh bi_server1
4) Set up OBIEE Components
cd /home/Oracle/Middleware/instances/instance1/bin/
./opmnctl startall
The result is:
opmnctl startall: starting opmn and all managed processes...
================================================================================
opmn id=JustiziaInf.mmmmm.mmmmm.9999
Response: 4 of 5 processes started.
ias-instance id=instance1
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ias-component/process-type/process-set:
coreapplication_obisch1/OracleBISchedulerComponent/coreapplication_obisch1/
Error
--> Process (index=1,uid=1064189424,pid=4396)
failed to start a managed process after the maximum retry limit
Log:
/home/Oracle/Middleware/instances/instance1/diagnostics/logs/OracleBISchedulerComponent/
coreapplication_obisch1/console~coreapplication_obisch1~1.log
5) Check the status of components
cd /home/Oracle/Middleware/instances/instance1/bin/
./opmnctl status
Processes in Instance: instance1
---------------------------------+--------------------+---------+---------
ias-component | process-type | pid | status
---------------------------------+--------------------+---------+---------
coreapplication_obiccs1 | OracleBIClusterCo~ | 8221 | Alive
coreapplication_obisch1 | OracleBIScheduler~ | N/A | Down
coreapplication_obijh1 | OracleBIJavaHostC~ | 8726 | Alive
coreapplication_obips1 | OracleBIPresentat~ | 6921 | Alive
coreapplication_obis1 | OracleBIServerCom~ | 7348 | Alive
Read the log files from /home/Oracle/Middleware/instances/instance1/diagnostics/logs/OracleBISchedulerComponent/
coreapplication_obisch1/console~coreapplication_obisch1~1.log.
I would recommend trying the the steps in the below link as this is a common issue when upgrading OBIEE.
http://www.askjohnobiee.com/2012/11/fyi-opmnctl-failed-to-start-managed.html
Not sure, what your log says, but try these below steps and check if it works or not
Login as superuser
cd $ORACLE_HOME/Apache/Apache/bin
chmod 6750 .apachectl
logout and login as ORACLE user
opmnctl startproc process-type=OracleBIScheduler