I'm leveraging Zabbix with a custom low-level-discovery that discovers a REST/API endpoint using Python. When the polling is on, the CPU utilization goes through the roof. All the CPU usage is caused by setroubleshootd as show in top:
top - 13:51:56 up 15:33, 1 user, load average: 1.52, 1.43, 1.37
Tasks: 127 total, 3 running, 124 sleeping, 0 stopped, 0 zombie
%Cpu(s): 35.8 us, 6.7 sy, 0.0 ni, 57.3 id, 0.1 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 8010508 total, 6211020 free, 397104 used, 1402384 buff/cache
KiB Swap: 1679356 total, 1679356 free, 0 used. 6852016 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7986 setroub+ 20 0 424072 130856 11548 R 77.4 1.6 7:12.16
Zabbix calls the agent and requests to execute a "UserParameter" which is shorthand for a script. That script is a bash file that calls my python script. and the call looks like this:
#!/usr/bin/env bash
/usr/bin/python /etc/zabbix/externalscripts/discovery.py $1 $2 $3 $4 $5
When zabbix calls the script, it passes the unique filters, like a server ID or network card ID, as one of the arguments. The python script opens up an https session using requests, leveraging a bearer token if the token file exists. If the token file doesn't exist it creates it.
The script works fine and does everything it is supposed to but setroubleshoot is rebooting a slew of issues, specifically around file folder access. The huge number of setroubleshootd responses is causing the CPU to go nuts. Here is an example of the error:
python: SELinux is preventing /usr/bin/python2.7 from create access on the file 7WMXFl.
The file name is random and changes with every execution. I've tried adding an exception using the selinux tools such as:
ausearch -c 'python' --raw | audit2allow -M my-python
But since the file name is random, the errors persist. I've tried uninstalling setroubleshootd, selinux just reinstalls it. Unfortunately, I need to run enforcing mode, so dropping to permissive or disabling are not options.
I've tried changing so that I'm not running a bash script, that zabbix calls the python script directly, or declaring shebang /usr/bin/python, but passing arguments doesn't seem to work properly. I get an error stating the $1 $2... are unknown arguments.
At a loss at this point. It is running, but I'd really like to get the CPU usage down as 60% of 4 cores is unreasonable for 30-40 HTTPS calls.
I ended up writing an SEModule for this that allows the zabbix user write access to the /tmp folder where these files are being created and managed. CPU usage dropped from 75% to 2%. #NailedIt
$>sudo ausearch -m avc | grep zabbix | grep denied | audit2allow -m zabbixallow > my_script.te
$>checkmodule -M -m -o zabbixallow.mod my_script.te
$>semodule_package -o zabbixallow.pp -m zabbixallow.mod
$>sudo semodule -i zabbixallow.pp
Hopefully this helps someone else if they run across this issue.
External scripts will have to complete within your timeout value, this sounds like it's too big for that. You could convert it to zabbix_sender and schedule it via cron. Then it's just a script with performance problems.
Related
i have a small meteor js app suddenly it starts using 100% cpu. i found some blogs that says maybe oplog causing the height usage of the cpu so i've disabled it using:
meteor add disable-oplog
but it did not changing anything. i'm facing this issue on the development environment ( run the app through " meteor " command ) and on the deployment environment (run the app remotly using mup ).
development environment : ubuntu 14.0 2G 64Bit meteor 1.3 node js 0.10.45.
deployment environment (droplet): ubuntu 14.0 512Mb 64Bit meteor 1.3 node js 0.10.45.
installed packages:
monitoring process:
I've run into this problem before, but only when running too many production Meteor development enviornments on one server for too long.
It was the swap solution I put in place. Meteor apps can use a lot of memory, and 512MB can be too little. It was swapping all the time, which oddly showed up as a CPU spike. Once I put a better swap configuration in place, all was fine.
This was on an Ubuntu server, I can't recall if it was 14 or 16. On Digital Ocean hosting (they have Swap disabled by default, and the solution I put in place first was apparently bad).
It may not be likely this is the answer for you, but I'm writing it up as it's certainly possible, and can be very hard to figure out.
Maybe you can try using CPU limiter here's a bash script I created
https://gist.github.com/cortezcristian/5ab4fdddcc573972d44873f1e97a2b88
You'll need to install cpu limiter first:
sudo apt-get install cpulimit
ps ax | grep node | grep meteor | grep -v grep | awk '{print $1}' > /tmp/my-app.pid
cpulimit --p $(cat /tmp/my-app.pid) --limit 77
After that you can choose the limit you want 50 / 100 with the --limit flag.
I am trying to run a compiled program that is supposed to be running on multiple processors. But with the same data, sometimes this program runs in parallel and sometimes it won't (with the identical PBS script file!). I am suspecting that something is wrong with some of the compute nodes that won't let it run on parallel (I don't get to choose the compute node I want). How can I troubleshoot if this is a bug in the program or it is problem with the compute node?
As per the sys admin's adivce, I am using ulimit -s 100000, but this don't change anything. Also, this program is not an mpi program (runs only on a single node, with multiple processors).
The code that I run is as follows:
quorum_error_correct_reads -q 68 \
--contaminant=/data004/software/GIF/packages/masurca/2.3.0rc1/bin/../share/adapter.jf \
-m 1 -s 1 -g 1 -a 3 --thread=32 -w 10 -e 3 \
quorum_mer_db.jf aa.renamed.fastq ab.renamed.fastq ac.renamed.fastq ad.renamed.fastq ae.renamed.fastq af.renamed.fastq ag.renamed.fastq \
--no-discard -o pe.cor --verbose
Thanks for any advice you can offer. I will greatly appreciate your help!
PS: I don't have sudo access.
EDIT: I know it is supposed to be using multiple processors because, when I SSH into the node and do top -c I can see (above command) sometimes running like 3200 % CPU (all the time) and sometimes only 100 % CPU all the time. This is the only step involved and there are no other sub-process within this program. Also, I am using HPC, where I submit the job to a compute node, each with 32 procs, 512GB RAM.
On my hosting account I run chat in Node.js. All works fine however my hosting timeout processes every 12 hours. Apparently when the process is deamonized it will not timeout and so I tried to demonize with:
using Forever.js - running forever start chat.js . Running forever list confirms it runs and ps -ef command shows ? in TTY column
tried nohup node chat.js - running ps -ef TTY column shows pts/0 and PPID is 1
I tried to disconnect stdin, stdout, and stderr, and make it ignore the hangup signal (SIGHUP) so nohup ./myscript 0<&- &> my.admin.log.file & with no luck. ps -ef TTY column is pts/0 and PPID is anything but 1
I tried (nohup ./myscript 0<&- &>my.admin.log.file &) with no luck again. ps -ef TTY column is pts/0 and PPID is 1
After all this process always timouts in about 12hrs.
Now I tried (nohup ./myscript 0<&- &>my.admin.log.file &) & and am waiting, but do not keep my hopes up and need someones help.
Hosting guys claim that daemon processes do not timeout but how can I make sure my process is a daemon? Noting I tried seems to work even though with my limited understanding ps -ef seems to suggest process is deamonized.
What shall I do to demonize the process without moving to much more expensive hosting plans? Can I argue with hosting that after all this porcess is a daemon and they just got it wrong somewhere?
Upstart is a really easy way to daemonize processes
http://upstart.ubuntu.com/
There's some info on using it with node and monit, which will restart Node for you if it crashes
http://howtonode.org/deploying-node-upstart-monit
I'm trying to plot the TCP congestion window and the slow start threshold using iperf and the tcp_probe module. I do exactly what is told here:
to obtain the data:
modprobe tcp_probe port=5001
chmod 444 /proc/net/tcpprobe
cat /proc/net/tcpprobe >/tmp/tcpprobe.out &
TCPCAP=$!
iperf -i 10 -t 100 -c receiver
kill $TCPCAP
Oops!
/tmp/tcpprobe.out is empty :(
This is Ubuntu 11.04 x86
and already tried the same on Ubuntu 11.04 x64
Any suggestions?
I was having the same problem. What worked for me was:
modprobe -r tcp_probe
sudo modprobe tcp_probe port=5002 full=1
sudo chmod 444 /proc/net/tcpprobe
cat /proc/net/tcpprobe > /tmp/tcpprobe.out &
TCPCAP=$!
iperf -c <servers IP address here> -p 5002 -t 100 -i 1
sudo kill $TCPCAP
See iperf parameters to check if those (-t 100 -i 1) are what you need by typing:
man iperf
I/O function in C standard library use buffer by default, usually 4k , so fread() only return when buffer full or EOF. You can use small buffer, 128 bytes, see:
dd if=/proc/net/tcpprobe ibs=128 obs=128
Now, message flush quickly.
By default the tcp_probe logs only when the cnwd changes, try modprobe tcp_probe ... full=1.
Linux source code referece: http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/net/ipv4/tcp_probe.c#L47
I had similiar issue, the tcp_probe module outputs only in non-obvious time intervals. I've created a modified version of it that flushes on every received tcp segments. This slows down the system, but allows to better monitor short-lived connections like HTTP.
Find the source code to the module here.
another issue which causes no output is file permission of output file tcpprobe.out
when cat tcpprobe directly, it's able to see the output, but if redirecting the output the file, the output file size is 0, which reminds me that it's the permission issue...
A very late answer, but have been struggling with this issue myself. I was trying out the version Dyna provided, yet still got no output, regardless of the parameters used. In the end, I found that the order was the problem.
The way I was using tcp_probe was: install/activate the module, run some tcp application (I was running some tcp unit tests), then start the copy process for /proc/net/tcpprobe (as shown in the other answers) and then remove/stop the module. The correct way is to start the copy process (barring killing of the process) BEFORE you perform the tcp intensive activity. Keep the cat process running while you perform the tcp activity and only kill it afterwards.
A pretty humbling experience for me, as it took hours to figure this out. Hopefully, people find this useful.
In UNIX, I have a utility, say 'Test_Ex', a binary file. How can I write a job or a shell script(as a cron job) running always in the background which keeps checking if 'Test_Ex' is still running every 5 seconds(and probably hide this job). If it is running, do nothing. If not, delete a directory at the specified path.
Try this script:
pgrep Test_Ex > /dev/null || rm -r dir
If you don't have pgrep, use
ps -e -ocomm | grep Test_Ex || ...
instead.
Utilities like upstart, originally part of the Ubuntu linux distribution I believe, are good for monitoring running tasks.
The best way to do this is to not do it. If you want to know if Test_Ex is still running, then start it from a script that looks something like:
#!/bin/sh
Test_Ex
logger "Test_Ex died"
rm /p/a/t/h
or
#!/bin/sh
while ! Test_ex
do
logger "Test_Ex terminated unsuccesfully, restarting in 5 seconds"
sleep 5
done
Querying ps regularly is a bad idea, and trying to monitor it from cron is a horrible, horrible idea. There seems to be some comfort in the idea that crond will always be running, but you can no more rely on that than you can rely on the wrapper script staying alive; either one can be killed at any time. Waking up every 10 seconds to query ps is just a waste of resources.