I am new to Airflow and I have done a few bad practices that include marking a task as failed while running and then Airflow leaves behind this task as a ghost running job and I cannot find a way to kill it. I have seen the option on_kill(), but I am not really sure how to implement it.
More specifically, I am starting some servers with a task called get_proxies (that start some servers to get a list of proxies to be used in the pipeline) via SSH; it usually takes some minutes to finish and then it continues with the pipeline; at the end of it I have a task called destroy_proxies (also via SSH), which, as its name suggests, destroys the servers where the proxies are running, and then also stops the containers (docker-compose down -d) and its trigger rule is all_done, so even in case of failure of the pipeline, it destroys the proxies.
Last time, while I was doing some tests and while the task get_proxies was running, I decided to mark it manually as Failed, so the task destroy_proxies was a few seconds later successfully executed, and of course it destroyed the proxies that were (so far) created. However, I noticed afterwards that the proxies were still running, so I need a way to handle those crashes or manually marked failed/success because it leaves those running-jobs in the background and there is no way for me to stop/kill those (I already have a workaround to destroy such proxies but still).
More specifically, I have two questions: 1) How to kill those running jobs? and 2) How to handle those tasks being killed externally so that they do not leave ghost jobs later on that are hard to access?
Some information about my Airflow version:
Apache Airflow
version | 2.3.2
executor | CeleryExecutor
task_logging_handler | airflow.utils.log.file_task_handler.FileTaskHandler
dags_folder | /opt/airflow/airflow-app/dags
plugins_folder | /opt/airflow/plugins
base_log_folder | /opt/airflow/logs
remote_base_log_folder |
System info
OS | Linux
architecture | x86_64
locale | ('en_US', 'UTF-8')
python_version | 3.9.13 (main, May 28 2022, 14:03:04) [GCC 10.2.1 20210110]
python_location | /usr/local/bin/python
For the last task, destroy_proxies, which is a SSHOperator, I was thinking of adding some wait time to the command being executed, but it is not a great solution, since the get_proxies task does not always last the same time.
On the other hand (but maybe it is worth asking a separate question for it), sometimes when I trigger the DAG manually, I get some failed tasks with no logs, and I might think it is related to it since there are some extra running jobs in the background so it might be facing memory issues...
It is my first time writing a question here, so I hope I am being clear, but in any case if more information is needed, or what I wrote is not entirely clear, I am always willing to re-write it and provide more info. Thank you all!
Related
We started to implement Airflow for task scheduling about a year ago, and we are slowly migrating more and more tasks to Airflow. At some point we noticed that the server was filling up with logs, even after we implemented remote logging to S3. I'm trying to understand what the best way to handle logs is, and I've found a lot of conflicting advice, such as in this stackoverflow question from 4 years ago.
Implementing maintenance dags to clean out logs (airflow-maintenance-dags)
Implementing our own FileTaskHandler
Using the logrotate linux utility
When we implemented remote logging, we expected the local logs to be removed after they were shipped to S3, but this is not the case. Local logs remain on the server. I thought this might be a problem with our configuration but I haven't found any way to fix that. Also, remote logging only applies to task logs, but process logs (specifically scheduler logs) are always local, and they took up the most space.
We tried to implement maintenance dags, but our workers are running from a different location to the rest of airflow, particularly the scheduler, so only task logs were getting cleaned. We could get around this by creating a new worker that shares logs with the scheduler, but we prefer not to create extra workers.
We haven't tried to implement either of the other two suggestions yet. But that is why I want to understand, how are other people solving this, and what is the recommended way?
Possibly a dumb question, but when I read the Airflow Arch for 2.4.2, there was no mention of Triggerer. The only pieces were Metastore, Webserver, Scheduler(executor is part of this).
Having said this, do we still need Triggerer pod in airflow 2.4.2 if the deployment is on EKS, and executor is KubernetesExecutor ?
What does the Triggerer pod do here?
Thanks
It’s needed for the deferrable operator, which is a pretty advanced operator if you’re just getting started. That operator frees up worker slots when a worker makes an API call that takes a while to execute, and marks the execution is awaiting a trigger. It’s completion is detected by the triggerer, which then is checked by the scheduler process when reassigning the worker to subsequent tasks.
Is is possible to spawn a new worker process and gracefully shutdown an existing one dynamically using Lua scripting in openresty?
Yes but No
Openresty itself doesn't really offer this kind of functionality directly, but it does give you the necessary building blocks:
nginx workers can be terminated by sending a signal to them
openresty allows you to read the PID of the current wroker thread
LuaJITs FFI allows you to use the kill() system call or
using os.execute you can just call kill directly.
Combining those, you should be able to achieve what you want :D
Note: After reading the question again, I noticed that I really only answered the second part.
nginx uses a set number of worker processes, so you can only shut down running workers which the master process will then restart, but the number will stay the same.
If you just want to change the number of worker processes, you would have to restart the nginx instance completely (I just tried nginx -s reload -g 'worker_processes 4;' and it didn't actually spawn any additional workers).
However, I can't see a good reason why you'd ever do that. If you need additional threads, there's a separate API for that, other than that, you'll probably just have to live with a hard restart.
Consider a linux cluster of N nodes. It needs to run M tasks. Each task can run on any node. Assume the cluster is up and working normally.
Question: what's the simplest way to monitor the M tasks are running, and if a task exits abnormally (exit code != 0), start a new task on any of the up machines. Ignore network partitions.
Two of the M tasks have a dependency so that if task 'm' does down, task 'm1' should be stopped. Then 'm' is started and when up, 'm1' can be restarted. 'm1' depends on 'm'. I can provide an orchestration script for this.
I eventually want to work up to Kubernetes which does self-healing but I'm not there yet.
The right (tm) way to do is to setup a retry, potentially with some back-off strategy. There were many similar questions here on StackOverflow how to do this - this is one of them.
If you still want to do the monitoring, and explicit task restart, then you can implement a service based on the task events that will do it for you. It is extremely simple, and a proof how brilliant Celery is. The service should handle the task-failed event. An example how to do it is on the same page.
If you just need an initialization task to run for each computation task, you can use the Job concept along with an init container. Jobs are tasks that run just once until completion, Kubernetes will restart it, if it crashes.
Init containers run before the actual pod containers are started and are used for initialization tasks: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
I have a job running using Hadoop 0.20 on 32 spot instances. It has been running for 9 hours with no errors. It has processed 3800 tasks during that time, but I have noticed that just two tasks appear to be stuck and have been running alone for a couple of hours (apparently responding because they don't time out). The tasks don't typically take more than 15 minutes. I don't want to lose all the work that's already been done, because it costs me a lot of money. I would really just like to kill those two tasks and have Hadoop either reassign them or just count them as failed. Until they stop, I cannot get the reduce results from the other 3798 maps!
But I can't figure out how to do that. I have considered trying to figure out which instances are running the tasks and then terminate those instances, but
I don't know how to figure out which instances are the culprits
I am afraid it will have unintended effects.
How do I just kill individual map tasks?
Generally, on a Hadoop cluster you can kill a particular task by issuing:
hadoop job -kill-task [attempt_id]
This will kill the given map task and re-submits it on an different
node with a new id.
To get the attemp_id navigate on the Jobtracker's web UI to the map task
in question, click on it and note it's id (e.g: attempt_201210111830_0012_m_000000_0)
ssh to the master node as mentioned by Lorand, and execute:
bin/hadoop job -list
bin/hadoop job –kill <JobID>