salt-stack highstate - find slow states - salt-stack

Running an initial install takes about 20 minutes, running a salt-call state.highstate takes about 6 minutes. That's not unreasonable, but I'd like to speed it up, but I'm not sure how to find the slowest states.
Is there any way to find how long each state takes to run other than watching my screen with a stopwatch for 6 minutes?

sudo salt-call state.highstate provides start-time and duration for each state.
----------
ID: ntp-removed
Function: pkg.removed
Result: True
Comment: None of the targeted packages are installed
Started: 12:45:04.430901
Duration: 0.955 ms
Changes:
You can capture this for processing:
salt-call state.highstate test=True --out json | tee output.json
python -c 'import json; j=json.load(open("output.json"))["local"];\
print [x["name"] for x in j.values() if x["duration"] > 1000];'
[u'munin-node']

Related

open screen session on many remote hosts executing complex command, don't exit afterward

I have a long list of remote hosts and I want to run a shell command on all of them. The command takes a very long time, so I want to run the command inside screen on the remote machine, disconnecting immediately from each, and I want the terminal output on the remote to be preserved after the command exits. There is a "tag" that should be supplied to each command as an argument. I tried to do this with parallel, something like this:
$ cat servers.txt
user1#server1.example.com/tag1
user2#server2.example.com/tag2
# ...
$ cat run.sh
grep -v '^#' servers.txt |
parallel ssh -tt '{//}' \
'tag={/}; exec screen slow_command --option1 --option2 $tag other args'
This doesn't work: all of the remote processes are launched, but they are not detached (so the ssh sessions remain live and I don't get my local shell back), and once each command finishes, its screen exits immediately and the output is lost.
How do I fix this script? Note: if this is easier to do with tmux and/or some other marshalling program besides parallel, I'm happy to hear answers that explain how to do it that way.
Something like this:
grep -v '^#' servers.txt |
parallel -q --colsep / ssh {1} "screen -d -m bash -c 'echo do stuff \"{2}\";sleep 1000000'"
The final sleep makes sure the screen does not die. You will have 1000000 seconds to attach to it and kill it.
There is an awful lot of quoting there - especially if do stuff is complex.
It may be easier to make a function that computes tag on the remote machine. You need GNU Parallel 20200522 for this:
env_parallel --session
f() {
sshlogin="$1"
# TODO given $sshlogin compute $tag (e.g. a table lookup)
do_stuff() {
echo "do stuff $tag"
sleep 1000000
}
export -f do_stuff
screen -d -m bash -c do_stuff "$#"
}
env_parallel --nonall --slf servers_without_tag f '$PARALLEL_SSHLOGIN'
env_parallel --endsession

How to set state_output=changes in a salt-stack schedule?

I have a salt schedule calling state.apply and using the highstate returner to write out a file. The schedule is being kicked off as expected, and the output file is being created, but all the unchanged states are included in the output.
On the command line, I'd force only diffs and errors to be with the --state_output=changes option of salt.
Is there a way to include set state_output=changes in the schedule somehow ?
My defining the schedule in the pillar data and it looks something like this:
schedule:
mysched:
function: state.apply
seconds: 3600
kwargs:
test: True
returner: highstate
returner_kwargs:
report_format: yaml
report_delivery: file
file_output: /path/to/mysched.yaml
I fixed this by switching the schedule as per below. Instead of calling state.apply directly, the schedule uses cmd.run to to kick off a salt-call command that does the state.apply, and that command can include the state-output flag.
schedule:
mysched:
function: cmd.run
args:
- "salt-call state.apply --state-output=changes --log-level=warning test=True > /path/to/mysched.out 2>&1"
seconds: 3600

Saltstack Packages Failed to Install on OpenBSD 5.8

I am kinda new to Saltstack so I may need some hand holding, but here it goes.
First some background info:
I am running a salt-master server on a CentOS 6.7 VM.
I am running a salt-minion on an OpenBSD 5.8 machine.
I have accepted the keys from the minion on the master and I am able to test.ping from the master to the minion. So the connection is fine.
I have created a bunch of .sls files for all of the packages I want to install under a directory called OpenBSD.
As an example, here is my bash/init.sls file:
bash:
pkg:
- installed
Very simple, right?
Now I run the command: # salt 'machinename' state.sls OpenBSD/bash
However this is what the salt-server responds with:
Machinename:
----------
ID: bash
Function: pkg.installed
Result: False
Comment: The following packages failed to install/update: bash
Started: 19:03:50.191735
Duration: 1342.497 ms
Changes:
Summary
------------
Succeeded: 0
Failed: 1
------------
What am I doing wrong?
Can you run with the option -ldebug attached to it and see if there is anything useful in the output? Also can you run the following and paste any useful output on the bsd box itself:
salt-call -ldebug state.sls OpenBSD.bash

Kill all R processes that hang for longer than a minute

I use crontask to regularly run Rscript. Unfortunately, I need to do this on a small instance of aws and the process may hang, building more and more processes on top of each other until the whole system is lagging.
I would like to write a crontask to kill all R processes lasting longer than one minute. I found another answer on Stack Overflow that I've adapted that I think would solve the problem. I came up with;
if [[ "$(uname)" = "Linux" ]];then killall --older-than 1m "/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";fi
I copied the task directly from htop, but it does not work as I expect. I get the No such file or directory error but I've checked it a few times.
I need to kill all R processes that have lasted longer than a minute. How can I do this?
You may want to avoid killing processes from another user and try SIGKILL (kill -9) after SIGTERM (kill -15). Here is a script you could execute every minute with a CRON job:
#!/bin/bash
PROCESS="R"
MAXTIME=`date -d '00:01:00' +'%s'`
function killpids()
{
PIDS=`pgrep -u "${USER}" -x "${PROCESS}"`
# Loop over all matching PIDs
for pid in ${PIDS}; do
# Retrieve duration of the process
TIME=`ps -o time:1= -p "${pid}" |
egrep -o "[0-9]{0,2}:?[0-9]{0,2}:[0-9]{2}$"`
# Convert TIME to timestamp
TTIME=`date -d "${TIME}" +'%s'`
# Check if the process should be killed
if [ "${TTIME}" -gt "${MAXTIME}" ]; then
kill ${1} "${pid}"
fi
done
}
# Leave a chance to kill processes properly (SIGTERM)
killpids "-15"
sleep 5
# Now kill remaining processes (SIGKILL)
killpids "-9"
Why imply an additional process every minute with cron?
Would it not be easier to start R with timeout from coreutils, the processes will then be killed automatically after the time you chose.
timeout [option] duration command [arg]…
I think the best option is to do this with R itself. I am no expert, but it seems the future package will allow executing a function in a separate thread. You could run the actual task in a separate thread, and in the main thread sleep for 60 seconds and then stop().
Previous Update
user1747036's answer which recommends timeout is a better alternative.
My original answer
This question is more appropriate for superuser, but here are a few things wrong with
if [[ "$(uname)" = "Linux" ]];then
killall --older-than 1m \
"/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";
fi
The name argument is either the name of image or path to it. You have included parameters to it as well
If -s signal is not specified killall sends SIGTERM which your process may ignore. Are you able to kill a long running script with this on the command line? You may need SIGKILL / -9
More at http://linux.die.net/man/1/killall

Wget Hanging, Script Stops

Evening,
I am running a lot of wget commands using xargs
cat urls.txt | xargs -n 1 -P 10 wget -q -t 2 --timeout 10 --dns-timeout 10 --connect-timeout 10 --read-timeout 20
However, once the file has been parsed, some of the wget instances 'hang.' I can still see them in system monitor, and it can take about 2 minutes for them all to complete.
Is there anyway I can specify that the instance should be killed after 10 seconds? I can re-download all the URLs that failed later.
In system monitor, the wget instances are shown as sk_wait_data when they hang. xargs is there as 'do_wait,' but wget seems to be the issue, as once I kill them, my script continues.
I believe this should do it:
wget -v -t 2 --timeout 10
According to the docs:
--timeout: Set the network timeout to seconds seconds. This is equivalent to specifying
--dns-timeout, --connect-timeout, and --read-timeout, all at the same time.
Check the verbose output too and see more of what it's doing.
Also, you can try:
timeout 10 wget -v -t 2
Or you can do what timeout does internally:
( cmdpid=$BASHPID; (sleep 10; kill $cmdpid) & exec wget -v -t 2 )
(As seen in: BASH FAQ entry #68: "How do I run a command, and have it abort (timeout) after N seconds?")
GNU Parallel can download in parallel, and retry the process after a timeout:
cat urls.txt | parallel -j10 --timeout 10 --retries 3 wget -q -t 2
If the time for an url to be fetched changes (e.g. due to faster internet connection), you can let GNU Parallel figure out the timeout:
cat urls.txt | parallel -j10 --timeout 1000% --retries 3 wget -q -t 2
This will make GNU Parallel record the median time for a successful job and set the timeout dynamically to 10 times that.

Resources