Linux bridge - hello and hold timers - networking

I am working on a small Linux embedded network bridged device (with 2.6 Tickless Kernel).
I am trying to reduce cpu wakeup time (When the system is idle) and check which timers can be avoided.
I have noticed that although STP is explicitly disabled on the bridge, hello timer and hold timers are expiring every 2 sec (default time).
echo 1 > /proc/timer_stats; sleep 30; echo 0 > /proc/timer_stats ; cat /proc/timer_stats
Timer Stats Version: v0.2
Sample period: 30.016 s
1, 1 init hrtimer_start_range_ns (hrtimer_wakeup)
15, 0 swapper br_transmit_config (br_hold_timer_expired)
15, 0 swapper run_timer_softirq (br_hello_timer_expired)
15, 5 events/0 worker_thread (delayed_work_timer_fn)
2D, 5 events/0 neigh_periodic_work (delayed_work_timer_fn)
1, 502 sleep hrtimer_start_range_ns (hrtimer_wakeup)
5, 0 swapper run_timer_softirq (sync_supers_timer_fn)
5, 10 bdi-default bdi_forker_task (process_timeout)
2D, 5 events/0 neigh_periodic_work (delayed_work_timer_fn)
1, 5 events/0 worker_thread (delayed_work_timer_fn)
1, 548 sleep hrtimer_start_range_ns (hrtimer_wakeup)
63 total events, 2.098 events/sec
Looking at the Kernel bridge code, it seems that these timers are set regardless of the STP mode (disable/enable).
One approach could be to make the timers expire time longer (brctl sethello br0 30), it is ok though not ideal.
A different approach would be to patch the kernel so timer initialization will not take place when STP is disabled, patching the kernel is also not ideal.
Is there a reason that these timers are initialized even though STP is disabled ?
Does anyone have a different idea/approach ?
Thanks

Related

ZeroWindow errors and netstat statistics?

I have been told that one of my servers intermittently throws ZeroWindow errors. I would like to monitor this in Prometheus.
If I run neststat -s some of the results are:
netstat -s
Ip:
...
IcmpMsg:
...
Tcp:
...
TcpExt:
TCPFromZeroWindowAdv: 96
TCPToZeroWindowAdv: 96
TCPWantZeroWindowAdv: 16
It is very difficult to find a definition for this the closest that I have found is:
WantZeroWindowAdv: +1 each time window size of a sock is 0
ToZeroWindowAdv: +1 each time window size of a sock dropped to 0
FromZeroWindowAdv: +1 each time window size of a sock increased from 0
Reading this I believe that WantZeroWindowAdv show the ZeroWindow problems. (It counts each time that a socket is requested its window size and responds with 0.)
Not part of the question - then I would need to add this to nodes_netstat.go for prometheus.
Am I correct - is this approach valid? Netstat is not highly documented.
Your descriptions of "To" and "From" are correct.
"Want" is when TCP would have liked to have sent a zero window back to a sender, but couldn't because that would have implied a shrinking of the window rather than it being full.

R and GNU Parallel - How to limit number of cores used

(New to GNU Parallel)
My aim is to run the same Rscript, with different arguments, over multiple cores. My first problem is to get this working on my laptop (2 real cores, 4 virtual), then I will port this over to one with 64 cores.
Currently:
I have a Rscript, "Test.R", which takes in arguments, does a thing (say adds some numbers then writes it to a file), then stops.
I have a "commands.txt" file containing the following:
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 100
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 50 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 50 200 1000
So this tells GNU parallel to run Test.R using R (I have installed this using anaconda)
In the terminal (after navigating to the desktop which is where Test.R and commands.txt are) I use the command:
parallel --jobs 2 < commands.txt
What I want this to do, is to use 2 cores, and run the commands, from commands.txt, until all tasks are complete. (I have tried variations on this command, such as changing the 2 to a 1, in this case, 2 of the cores run at 100%, and the other 2 run around 20-30%).
When I run this, all of the 4 cores go to 100% (as seen from htop), and the first 2 jobs complete, and no more jobs get complete, despite all 4 cores still being at 100%.
When I run the same command on the 64 core compute, all 64 cores go to 100%, and I have to cancel the jobs.
Any advice on resources to look at, or what I am doing wrong would be greatly appreciated.
Bit of a long question, let me know if I can clarify anything.
The output from htop as requested, during running the above command (sorted by CPU%:
1 [||||||||||||||||||||||||100.0%] Tasks: 490, 490 thr; 4 running
2 [|||||||||||||||||||||||||99.3%] Load average: 4.24 3.46 4.12
3 [||||||||||||||||||||||||100.0%] Uptime: 1 day, 18:56:02
4 [||||||||||||||||||||||||100.0%]
Mem[|||||||||||||||||||5.83G/8.00G]
Swp[|||||||||| 678M/2.00G]
PID USER PRI NI VIRT RES S CPU% MEM% TIME+ Command
9719 user 16 0 4763M 291M ? 182. 3.6 0:19.74 /Users/user/anaconda3
9711 user 16 0 4763M 294M ? 182. 3.6 0:20.69 /Users/user/anaconda3
7575 user 24 0 4446M 94240 ? 11.7 1.1 1:52.76 /Applications/Utilities
8833 user 17 0 86.0G 259M ? 0.8 3.2 1:33.25 /System/Library/StagedF
9709 user 24 0 4195M 2664 R 0.2 0.0 0:00.12 htop
9676 user 24 0 4197M 14496 ? 0.0 0.2 0:00.13 perl /usr/local/bin/par
Based on the output from htop the script /Users/name/anaconda3/lib/R/bin/Rscript uses more than one CPU thread (182%). You have 4 CPU threads and since you run 2 Rscripts we cannot tell if Rscript would eat all 4 CPU threads if it ran by itself. Maybe it will eat all CPU threads that are available (your test on the 64 core machine suggests this).
If you are using GNU/Linux you can limit which CPU threads a program can use with taskset:
taskset 9 parallel --jobs 2 < commands.txt
This should force GNU Parallel (and all its children) to only use CPU thread 1 and 4 (9 in binary: 1001). Thus running that should limit the two jobs to run in two threads only.
By using 9 (1001 binary) or 6 (0110 binary) we are reasonably sure that the two CPU threads are on two different cores. 3 (11 binary) might refer to the two threads on the came CPU core and would therefore probably be slower. The same goes for 5 (101 binary).
In general you want to use as many CPU threads as possible as that will typically make the computation faster. It is unclear from your question why you want to avoid this.
If you are sharing the server with others a better solution is to use nice. This way you can use all the CPU power that others are not using.

MariaDB uses almost 100% CPU usage when innodb-encryption-threads > 1

Our MariaDB server suddenly started to use all available CPU on our encrypted database.
It doesn't seem to have an impact on performance though.
It seems to be a thread locking issues and setting innodb-encryption-threads from 4 to 1 fixes the issue.
Which correspond to 4 threads using all the cpu (50% each on a dual core).
An strace on one of the offending threads floods with this:
futex(0x561733657bc4, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x561733657b60, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x561733657b60, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb4ede24af0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1520416318, tv_nsec=32097000}, 0xffffffff) = 0
futex(0x7fb4ede24a90, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x7fb4ede24a90, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()
What is causing this and how can we fix it?
When saturating a server resource (CPU/IO/etc), and you have control over how many threads/processes/requests/etc are demanding the resource, it is often better to throttle the number of threads back.
In your example, you have 4 threads hitting 2 cores. In doing so, each thread is taking twice (4/2) as long as it could to finish its task (encrypting something).
That is, 4 things (InnoDB blocks?) are being locked for twice as long. Versus 2 things being locked for less time.

OpenMP and MPI hybrid program

I have a machine with 8 processors. I want to alternate using OpenMP and MPI on my code like this:
OpenMP phase:
ranks 1-7 wait on a MPI_Barrier
rank 0 uses all 8 processors with OpenMP
MPI phase:
rank 0 reaches barrier and all ranks use one processor each
So far, I've done:
set I_MPI_WAIT_MODE 1 so that ranks 1-7 don't use the CPU while on the barrier.
set omp_set_num_threads(8) on rank 0 so that it launches 8 OpenMP threads.
It all worked. Rank 0 did launch 8 threads, but all are confined to one processor. On the OpenMP phase I get 8 threads from rank 0 running on one processor and all other processors are idle.
How do I tell MPI to allow rank 0 to use the other processors? I am using Intel MPI, but could switch to OpenMPI or MPICH if needed.
The following code shows an example on how to save the CPU affinity mask before the OpenMP part, alter it to allow all CPUs for the duration of the parallel region and then restore the previous CPU affinity mask. The code is Linux specific and it makes no sense if you do not enable process pinning by the MPI library - activated by passing --bind-to-core or --bind-to-socket to mpiexec in Open MPI; deactivated by setting I_MPI_PIN to disable in Intel MPI (the default on 4.x is to pin processes).
#define _GNU_SOURCE
#include <sched.h>
...
cpu_set_t *oldmask, *mask;
size_t size;
int nrcpus = 256; // 256 cores should be more than enough
int i;
// Save the old affinity mask
oldmask = CPU_ALLOC(nrcpus);
size = CPU_ALLOC_SIZE(nrcpus);
CPU_ZERO_S(size, oldmask);
if (sched_getaffinity(0, size, oldmask) == -1) { error }
// Temporary allow running on all processors
mask = CPU_ALLOC(nrcpus);
for (i = 0; i < nrcpus; i++)
CPU_SET_S(i, size, mask);
if (sched_setaffinity(0, size, mask) == -1) { error }
#pragma omp parallel
{
}
CPU_FREE(mask);
// Restore the saved affinity mask
if (sched_setaffinity(0, size, oldmask) == -1) { error }
CPU_FREE(oldmask);
...
You can also tweak the pinning arguments of the OpenMP run-time. For GCC/libgomp the affinity is controlled by the GOMP_CPU_AFFINITY environment variable, while for Intel compilers it is KMP_AFFINITY. You can still use the code above if the OpenMP run-time intersects the supplied affinity mask with that of the process.
Just for the sake of completeness - saving, setting and restoring the affinity mask on Windows:
#include <windows.h>
...
HANDLE hCurrentProc, hDupCurrentProc;
DWORD_PTR dwpSysAffinityMask, dwpProcAffinityMask;
// Obtain a usable handle of the current process
hCurrentProc = GetCurrentProcess();
DuplicateHandle(hCurrentProc, hCurrentProc, hCurrentProc,
&hDupCurrentProc, 0, FALSE, DUPLICATE_SAME_ACCESS);
// Get the old affinity mask
GetProcessAffinityMask(hDupCurrentProc,
&dwpProcAffinityMask, &dwpSysAffinityMask);
// Temporary allow running on all CPUs in the system affinity mask
SetProcessAffinityMask(hDupCurrentProc, &dwpSysAffinityMask);
#pragma omp parallel
{
}
// Restore the old affinity mask
SetProcessAffinityMask(hDupCurrentProc, &dwpProcAffinityMask);
CloseHandle(hDupCurrentProc);
...
Should work with a single processor group (up to 64 logical processors).
Thanks all for the comments and answers. You are all right. It's all about the "PIN" option.
To solve my problem, I just had to:
I_MPI_WAIT_MODE=1
I_MPI_PIN_DOMAIN=omp
Simple as that. Now all processors are available to all ranks.
The option
I_MPI_DEBUG=4
shows which processors each rank gets.

Why number 9 in kill -9 command in unix? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I understand it's off topic, I couldn't find anywhere online and I was thinking maybe programming gurus in the community might know this.
I usually use
kill -9 pid
to kill the job. I always wondered the origin of 9. I looked it up online, and it says
"9 Means KILL signal that is not catchable or ignorable. In other words it would signal process (some running application) to quit immediately" (source: http://wiki.answers.com/Q/What_does_kill_-9_do_in_unix_in_its_entirety)
But, why 9? and what about the other numbers? is there any historical significance or because of the architecture of Unix?
See the wikipedia article on Unix signals for the list of other signals. SIGKILL just happened to get the number 9.
You can as well use the mnemonics, as the numbers:
kill -SIGKILL pid
There were 8 other signals they came up with first.
I think a better answer here is simply this:
mike#sleepycat:~☺ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR
31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3
38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7
58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX
As for the "significance" of 9... I would say there is probably none. According to The Linux Programming Interface(p 388):
Each signal is defined as a unique (small) integer, starting
sequentially from 1. These integers are defined in with
symbolic names of the form SIGxxxx . Since the actual numbers used for
each signal vary across implementations, it is these symbolic names
that are always used in programs.
First you need to know what are Signals in Unix-like systems (It'll take just few minutes).
Signals, are software interrupts sent to a (running) program to indicate that an important event has occurred.
The events can vary from user requests to illegal memory access
errors. Some signals, such as the interrupt signal, indicate that a
user has asked the program to do something that is not in the usual
flow of control.
There are several types of Signals we can use - to get a full list of all the available/possible Signals use "$ kill -l" command:
In the above output it's clearly visible, that each Signal has a 'signal number' (e.g. 1, 2, 3) and a 'signal name' (e.g. SIGUP, SIGINT, SIGQUIT) associated with it. For a detailed look up what each and every Signal does, visit this link.
Finally, coming to the question "Why number 9 in kill -9 command":
There are several methods of delivering signals to a program or script. One of commonly used method for sending signal is to use the kill command - the basic syntax is:
$ kill -signal pid
Where signal is either the number or name of the signal, followed by the process Id (pid) to which the signal will be sent.
For example - -SIGKILL (or -9), signal kills the process immediately.
$ kill -SIGKILL 1001
and
$ kill -9 1001
both command are one the same thing i.e. above we have used the 'signal name', and later we have used 'signal number'.
Verdict: One has an open choice to whether use the 'signal name' or 'signal number' with the kill command.
It's a reference to "Revoulution 9" by the Beatles. A collection of strung together sound clips and found noises, this recording features John Lennon repeating over and over "Number 9, Number 9..." Further, this song drew further attention in 1969 when it was discovered that when played backwards, John seemed to be saying "Turn me on, dead man..."
Therefore the ninth signal was destined to be the deadliest of the kill signals.
There’s a very long list of Unix signals, which you can view on Wikipedia. Somewhat confusingly, you can actually use kill to send any signal to a process. For instance, kill -SIGSTOP 12345 forces process 12345 to pause its execution, while kill -SIGCONT 12345 tells it to resume. A slightly less cryptic version of kill -9 is kill -SIGKILL.
I don't think there is any significance to number 9. In addition, despite common believe, kill is used not only to kill processes but also send a signal to a process.
If you are really curious you can read here and here.
why kill -9 :
the number 9 in the list of signals has been chosen to be SIGKILL in reference to "kill the 9 lives of a cat".
SIGKILL use to kill the process. SIGKILL can not be ignored or handled. In Linux, Ways to give SIGKILL.
kill -9 <process_pid>
kill -SIGKILL <process_pid>
killall -SIGKILL <process_name>
killall -9 <process_name>
Type the kill -l command on your shell
you will found that at 9th number [ 9) SIGKILL ], so one can use
either kill -9 or kill -SIGKILL
SIGKILL is sure kill signal, It can not be dis-positioned, ignore or handle.
It always work with its default behaviour, which is to kill the process.
The -9 is the signal_number, and specifies that the kill message sent should be of the KILL (non-catchable, non-ignorable) type.
kill -9 pid
Which is same as below.
kill -SIGKILL pid
Without specifying a signal_number the default is -15, which is TERM (software termination signal). Typing kill <pid> is the same as kill -15 <pid>.
Both are same as kill -sigkill processID, kill -9 processID.
Its basically for forced termination of the process.
there are some process which cannot be kill like this "kill %1" . if we have to terminate that process so special command is used to kill that process which is kill -9.
eg
open vim and stop if by using ctrl+z then see jobs and after apply kill process than this process will not terminated so here we use kill -9 command for terminating.

Resources