Reboot nodes from salt scheduler randomly - salt-stack

I have a task to reboot machines one by one, with let say 1 hour interval between reboot of machine 1 and machine 2, it can be some reasonable amount of machines, but still not 1 or 2. The interval between reboots should be configurable, or even random, like splay.
I've searched documentation and could not find anything beautiful for this task.
So far I could made up solution with batch 1, and run every hour from cron, but maybe there is some better solution available?
The main problem is how to make sure only one machine is rebooting at the moment.

Related

Reducing replication on Openstack Swift

Is it possible to limit replication on OpenStack Swift to one or possibly 2? 3 data replicas are currently being created.
The default replication in Swift is 3 replicas, but you can change that to any number greater or equal to 1. The swift-ring-builder command has a set-replicas verb for adjusting the number of replicas; see
https://docs.openstack.org/swift/xena/admin/objectstorage-ringbuilder.html#replica-counts
Of course, there is a tradeoff between the number of replicas and your cluster's ability to cope with loss of disk drives and servers.
You also asked this:
If I don't want replication on Openstack Swift(replication equal to 1), can I achieve this by disabling replicators? sudo swift-init object-replicator stop If I may, is this a better option?
I guess it would probably work ... until someone or something restarts the replicators.
But it is not a good idea:
Monitoring may complain that the replicators are not working(!)
Other swift tools may complain that the cluster is out of spec; e.g. swift-recon --replication.
If you have been previously running with (say) a 3 replica policy, then nothing will delete the (now) redundant replicas created before you turned off the replicators.

Slowdown at increased number of processes for HPC with fat-tree architecture

I've noticed something particularly odd about a simple program I've been running on a HPC with a fat tree architecture and I'm not exactly sure why I'm getting the results I'm getting.
The program I've created simply prints the runtime of a program on a varying number of processes (using MPI). I experimented by varying the number of processes by 2^n from 2 to 256, and while the execution time for each process tends to decrease as the number of processes increases from 2 to 8 processes, this time jumps dramatically at 64 processes.
Could this be because of the architecture, itself? I'd imagine that the execution time would decrease with respect to the number of processes, but this doesn't seem to be the case past a certain threshold of processes.
I figured out the issue a while ago after reading the documentation (go figure) and wanted to post the solution here in case anyone had a similar issue. On the HPC I was using (AFRL's Mustang), I was executing my programs with mpirun on the login node. The documentation clearly states that jobs need to be submitted via a batch script per section 6 of the user guide:
https://www.afrl.hpc.mil/docs/mustangQuickStartGuide.html#jobSubmit

Autosys Machine Container

I am looking to set up a machine container in Autosys to look like the below example:
Example_Example_MAIN
Example_Example_MAIN.Machine_Name1
Example_Example_MAIN.Machine_Name2
Example_Example_MAIN.Machine_Name3
Example_Example_MAIN.Machine_Name4
The way i am currently controlling these machine is to send 2,3 & 4 Offline and leave 1 Online. Then if 1 goes Offline then i will send 2 Online and the batch will run on that machine.
Is it possible to leave all machines inside of a container Online but specify a machine priority? For example if i leave all machines Online then the batch will automatically target Machine_Name1 but if 1 goes Offline then the batch will automatically target machine 2 and so on.
Sorry if this is a silly question, i'm still only a beginner!
Thank you in advance!
Cameron.
Yes, you can place all of your machines in a single pool. Autosys will only send jobs to the machines in the pool that are Online.
To do further load balancing than that, you'll have to configure the Factor (how fast it is relative to your other machines) and Max_Load (how much work it can handle at once) for every machine in the pool, as well as setting Job_Load units on each of your jobs indicating how much CPU they consume when running.
Refer to Chapter 3 of the Autosys user guide for the full details.

IIS High Thread Count

I have an IIS application which behaves like this - Number of total threads in IIS processes is low, traffic starts at some low rate like 5 rpm, the number of threads starts increasing, alarmingly, keeps on going even after load stops, does not gets down in reasonable time, reaches like 30,000 plus threads, response time goes for a toss.
Machine config is set to auto_Config.
There are no explicit threads in application, though there is some --very fancy-- use of parallel for each.
Looking for some tips on how do I go about diagnosing this. Reducing parallel for each seemed to help; I am yet to conclusively prove it. Limiting max number of threads also helps cap the thread count; but I am thinking that there is something wrong with the app that causes those threads to keep increasing. I would want to solve this.
In the picture below, the thread count is ONLY for IIS worker processes. The PUT requests are the only ones doing some work; gets are mostly static resources requests.
Can this be reproduced in a local or dev environment? If so it's a good time to attach to the process and use the debugging tools to see what threads are managed and where they are in code. If that fails to unveil anything then it might be a time to capture a memory dump from the process and dig into it with windbg.

How do I kill running map tasks on Amazon EMR?

I have a job running using Hadoop 0.20 on 32 spot instances. It has been running for 9 hours with no errors. It has processed 3800 tasks during that time, but I have noticed that just two tasks appear to be stuck and have been running alone for a couple of hours (apparently responding because they don't time out). The tasks don't typically take more than 15 minutes. I don't want to lose all the work that's already been done, because it costs me a lot of money. I would really just like to kill those two tasks and have Hadoop either reassign them or just count them as failed. Until they stop, I cannot get the reduce results from the other 3798 maps!
But I can't figure out how to do that. I have considered trying to figure out which instances are running the tasks and then terminate those instances, but
I don't know how to figure out which instances are the culprits
I am afraid it will have unintended effects.
How do I just kill individual map tasks?
Generally, on a Hadoop cluster you can kill a particular task by issuing:
hadoop job -kill-task [attempt_id]
This will kill the given map task and re-submits it on an different
node with a new id.
To get the attemp_id navigate on the Jobtracker's web UI to the map task
in question, click on it and note it's id (e.g: attempt_201210111830_0012_m_000000_0)
ssh to the master node as mentioned by Lorand, and execute:
bin/hadoop job -list
bin/hadoop job –kill <JobID>

Resources