I have a job running using Hadoop 0.20 on 32 spot instances. It has been running for 9 hours with no errors. It has processed 3800 tasks during that time, but I have noticed that just two tasks appear to be stuck and have been running alone for a couple of hours (apparently responding because they don't time out). The tasks don't typically take more than 15 minutes. I don't want to lose all the work that's already been done, because it costs me a lot of money. I would really just like to kill those two tasks and have Hadoop either reassign them or just count them as failed. Until they stop, I cannot get the reduce results from the other 3798 maps!
But I can't figure out how to do that. I have considered trying to figure out which instances are running the tasks and then terminate those instances, but
I don't know how to figure out which instances are the culprits
I am afraid it will have unintended effects.
How do I just kill individual map tasks?
Generally, on a Hadoop cluster you can kill a particular task by issuing:
hadoop job -kill-task [attempt_id]
This will kill the given map task and re-submits it on an different
node with a new id.
To get the attemp_id navigate on the Jobtracker's web UI to the map task
in question, click on it and note it's id (e.g: attempt_201210111830_0012_m_000000_0)
ssh to the master node as mentioned by Lorand, and execute:
bin/hadoop job -list
bin/hadoop job –kill <JobID>
Related
I am new to Airflow and I have done a few bad practices that include marking a task as failed while running and then Airflow leaves behind this task as a ghost running job and I cannot find a way to kill it. I have seen the option on_kill(), but I am not really sure how to implement it.
More specifically, I am starting some servers with a task called get_proxies (that start some servers to get a list of proxies to be used in the pipeline) via SSH; it usually takes some minutes to finish and then it continues with the pipeline; at the end of it I have a task called destroy_proxies (also via SSH), which, as its name suggests, destroys the servers where the proxies are running, and then also stops the containers (docker-compose down -d) and its trigger rule is all_done, so even in case of failure of the pipeline, it destroys the proxies.
Last time, while I was doing some tests and while the task get_proxies was running, I decided to mark it manually as Failed, so the task destroy_proxies was a few seconds later successfully executed, and of course it destroyed the proxies that were (so far) created. However, I noticed afterwards that the proxies were still running, so I need a way to handle those crashes or manually marked failed/success because it leaves those running-jobs in the background and there is no way for me to stop/kill those (I already have a workaround to destroy such proxies but still).
More specifically, I have two questions: 1) How to kill those running jobs? and 2) How to handle those tasks being killed externally so that they do not leave ghost jobs later on that are hard to access?
Some information about my Airflow version:
Apache Airflow
version | 2.3.2
executor | CeleryExecutor
task_logging_handler | airflow.utils.log.file_task_handler.FileTaskHandler
dags_folder | /opt/airflow/airflow-app/dags
plugins_folder | /opt/airflow/plugins
base_log_folder | /opt/airflow/logs
remote_base_log_folder |
System info
OS | Linux
architecture | x86_64
locale | ('en_US', 'UTF-8')
python_version | 3.9.13 (main, May 28 2022, 14:03:04) [GCC 10.2.1 20210110]
python_location | /usr/local/bin/python
For the last task, destroy_proxies, which is a SSHOperator, I was thinking of adding some wait time to the command being executed, but it is not a great solution, since the get_proxies task does not always last the same time.
On the other hand (but maybe it is worth asking a separate question for it), sometimes when I trigger the DAG manually, I get some failed tasks with no logs, and I might think it is related to it since there are some extra running jobs in the background so it might be facing memory issues...
It is my first time writing a question here, so I hope I am being clear, but in any case if more information is needed, or what I wrote is not entirely clear, I am always willing to re-write it and provide more info. Thank you all!
I have a DAG with 2 tasks:
download_file_from_ftp >> transform_file
My concern is that tasks can be performed on different workers.The file will be downloaded on the first worker and will be transformed on another worker. An error will occur because the file is missing on the second worker. Is it possible to configure the dag that all tasks are performed on one worker?
It's a bad practice. Even if you will find a work around it will be very unreliable.
In general, if your executor allows this - you can configure tasks to execute on a specific worker type. For example in CeleryExecutor you can set tasks to a specific Queue. Assuming there is only 1 worker consuming from that queue then your tasks will be executed on the same worker BUT the fact that it's 1 worker doesn't mean it will be the same machine. It highly depended on the infrastructure that you use. For example: when you restart your machines do you get the exact same machine or new one is spawned?
I highly advise you - don't go down this road.
To solve your issue either download the file to shared disk space like S3, Google cloud storage, etc... then all workers can read the file as it's stored in cloud or combine the download and transform into a single operator thus both actions are executed together.
Consider a linux cluster of N nodes. It needs to run M tasks. Each task can run on any node. Assume the cluster is up and working normally.
Question: what's the simplest way to monitor the M tasks are running, and if a task exits abnormally (exit code != 0), start a new task on any of the up machines. Ignore network partitions.
Two of the M tasks have a dependency so that if task 'm' does down, task 'm1' should be stopped. Then 'm' is started and when up, 'm1' can be restarted. 'm1' depends on 'm'. I can provide an orchestration script for this.
I eventually want to work up to Kubernetes which does self-healing but I'm not there yet.
The right (tm) way to do is to setup a retry, potentially with some back-off strategy. There were many similar questions here on StackOverflow how to do this - this is one of them.
If you still want to do the monitoring, and explicit task restart, then you can implement a service based on the task events that will do it for you. It is extremely simple, and a proof how brilliant Celery is. The service should handle the task-failed event. An example how to do it is on the same page.
If you just need an initialization task to run for each computation task, you can use the Job concept along with an init container. Jobs are tasks that run just once until completion, Kubernetes will restart it, if it crashes.
Init containers run before the actual pod containers are started and are used for initialization tasks: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.
I'm tryin get sun gridending (sge) to run the separate processes of an MPI job over all of the nodes of my cluster.
What is happening is that each node has 12 processors, so SGE is assigning 12 of my 60 processes to 5 separate nodes.
I'd like it to assign 2 processes to each of the 30 nodes available, because with 12 processes (dna sequence alignments) running on each node, the nodes are running out of memory.
So I'm wondering if it's possible to explicitly get SGE to assign the processes to a given node?
Thanks,
Paul.
Check out "allocation_rule" in the configuration for the parallel environment; either with that or then by specifying $pe_slots for allocation_rule and then using the -pe option to qsub you should be able to do what you ask for above.
You can do it by creating a queue in which you can define the queue uses only only 2 processors out of 12 processors in each node.
You can see configuration of current queue by using the command
qconf -sq queuename
you will see following in the queue configuration. This queue named in such a way that it uses only 5 execution hosts and 4 slots (processors) each.
....
slots 1,[master=4],[slave1=4],[slave2=4],[slave3=4],[slave4=4]
....
use following command to change the queue configuration
qconf -mq queuename
then change those 4 into 2.
From an admin host, run "qconf -msconf" to edit the scheduler configuration. It will bring up a list of configuration options in an editor. Look for one called "load_factor". Set the value to "-slots" (without the quotes).
This tells the scheduler that the machine is least loaded when it has the fewest slots in use. If your exec hosts have a similar number of slots each, you will get an even distribution. If you have some exec hosts that have more slots than the others, they will be preferred, but your distribution will still be more even than the default value for load_factor (which I don't remember, having changed this in my cluster quite some time ago).
You may need to set the slots on each host. I have done this myself because I need to limit the number of jobs on a particular set of boxes to less than their maximum because they don't have as much memory as some of the other ones. I don't know if it is required for this load_factor configuration, but if it is, you can add a slots consumable to each host. Do this with "qconf -me hostname", add a value to the "complex_values" that looks like "slots=16" where 16 is the number of slots you want that host to use.
This is what I learned from our sysadmin. Put this SGE resource request in your job script:
#$ -l nodes=30,ppn=2
Requests 2 MPI processes per node (ppn) and 30 nodes. I think there is no guarantee that this 30x2 layout will work on a 30-node cluster if other users also run lots of jobs but perhaps you can give it a try.