Our MariaDB server suddenly started to use all available CPU on our encrypted database.
It doesn't seem to have an impact on performance though.
It seems to be a thread locking issues and setting innodb-encryption-threads from 4 to 1 fixes the issue.
Which correspond to 4 threads using all the cpu (50% each on a dual core).
An strace on one of the offending threads floods with this:
futex(0x561733657bc4, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x561733657b60, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x561733657b60, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb4ede24af0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1520416318, tv_nsec=32097000}, 0xffffffff) = 0
futex(0x7fb4ede24a90, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x7fb4ede24a90, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()
What is causing this and how can we fix it?
When saturating a server resource (CPU/IO/etc), and you have control over how many threads/processes/requests/etc are demanding the resource, it is often better to throttle the number of threads back.
In your example, you have 4 threads hitting 2 cores. In doing so, each thread is taking twice (4/2) as long as it could to finish its task (encrypting something).
That is, 4 things (InnoDB blocks?) are being locked for twice as long. Versus 2 things being locked for less time.
Related
I have a function that I wrote for 1 gpu, and it runs for 10 seconds with one set of args, and I have a very long list of args to go through. I would like to use both my AMD gpus, so I have some wrapper code that launches 2 threads, and runs my function on thread 0 with an argument gpu_idx 0 and on thread 1 with an argument gpu_idx 1.
I have a cuda version for another machine, and I just run checkCudaErrors(cudaSetDevice((unsigned int)device_id)); to get my desired behavior.
With openCL I have tried to do the following:
void createDevice(int device_idx)
{
cl_device_id *devices;
ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
HANDLE_CLERROR_G(ret);
ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, 0, NULL, &ret_num_devices);
HANDLE_CLERROR_G(ret);
devices = (cl_device_id*)malloc(ret_num_devices*sizeof(cl_device_id));
ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, ret_num_devices, devices, &ret_num_devices);
HANDLE_CLERROR_G(ret);
if (device_idx >= ret_num_devices)
{
fprintf(stderr, "Found %i devices but asked for device at index %i\n", ret_num_devices, device_idx);
exit(1);
}
device_id = devices[device_idx];
// usleep(((unsigned int)(500000*(1-device_idx)))); // without this line multithreaded 2 gpu execution does not work.
context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret);
HANDLE_CLERROR_G(ret);
}
context is a static variable in my *c file that I then use later again when I create the kernel.
This code works when I run only with device_idx 0, or only with device_idx 1, and even if I manually in two terminal windows run the executable "simultaneously" with device_idx 0 and device_idx 1.
BUT, there is something about the threads being "too" concurrent that prevents this code from working. In fact, depending on the amount of sleep (commented above), I get different behavior (sometimes both threads do work on gpu 0, sometimes both threads do work on gpu 1, sometimes threads are balanced on both gpus). If I sleep for too little time, I either get: CL_INVALID_CONTEXT and if I don't sleep at all I get CL_INVALID_KERNEL_NAME.
Like I said, I don't get any errors when running on gpu 0 or gpu 1 alone, only when spawning multiple threads that call this code (as an *so with an extern C function from go) simultaneously with device_idx 0 in thread 0 and device_idx 1 in thread 1.
How can I solve my problem? I am attached to the idea that I have an executable that works on 1 gpu, for which I specify which gpu, and that specification should be respected.
What is the proper way to pick the device when both devices need to be used, one completely separate from the other?
Whoops! Instead of saving device_id into a static variable I started returning from the above code and using it as a local variable, and everything works as expected, and is now thread safe.
I have a 5 stage Datapath with the following steps' times:
Fetch 190ps
Decode 120ps
Alu 170ps
Memory 200ps
Writeback 120ps
It's asked to calculate how many instructions can be executed in 1us knowing that the processor is working in multi-cycle without pipeline and that the clock is optimised.
I know that if processor was pipelined and the pipeline was initially empty, the number of instructions would be 4996 by doing:
200ps (longest stage's time) -> 1 instruction
1 us -> x
x=5000
NÂș of instructions = 5000-4=4996
Since there's no pipeline on this case what I did was:
190ps+120ps+170ps+200ps+120ps = 800ps
800ps -> 1 instruction
1 us -> x
x = 1250 instructions
However the correct answer is 1000 instructions.
Can someone explain me why?
Thank you
I am running an MPI C++ program locally with two processes: mpirun -np 2 <programname>. I am seeing inconsistent behavior of the MPI_Gather command. To test, I wrote a very short code snippet. I copied the code to the start of main and it worked fine. But when copied it to other points in the code, it sometimes gives the correct result and sometimes not. The code snippet is copied below. I doubt the issue is with the code snippet itself (since it sometimes works properly). Typically, when I see inconsistent code behavior like this, I suspect a memory corruption. However, I have run Valgrind in this case and it did not report anything amiss (although maybe I am not running Valgrind correctly for MPI - I am not experienced using Valgrind on MPI programs). What could be causing this type of inconsistent behavior and what can I do to detect the problem?
Here is the code snippet.
double val[2] = {0, 1};
val[0] += 10.0*double(gmpirank);
val[1] += 10.0*double(gmpirank);
double recv[4];
printdebug("send", val[0],val[1]);
int err = MPI_Gather(val,2,MPI_DOUBLE,recv,2,MPI_DOUBLE,0,MPI_COMM_WORLD);
if (gmpirank == 0) {
printdebug("recv");
printdebug(recv[0],recv[1]);
printdebug(recv[2],recv[3]);
}
printdebug("finished test", err);
The print debug function prints to a file, which is separate for each process, and separates the input arguments with a comma.
Process 1 prints:
send, 10, 11
finished test, 0
Sometimes, Process 0 prints:
send, 0, 1
recv
0, 1
10, 11
finished test, 0
But when I place the code in other sections of the code, Process 0 sometimes prints something like this:
send, 0, 1
recv
0, 1
2.9643938750474793e-322, 0
finished test, 0
I found the solution. As suspected, the problem was a memory corruption.
I made a beginner mistake when running Valgrind with MPI. I ran:
valgrind <options> mpirun -np 2 <programname>
instead of
mpirun -np 2 valgrind <options> <programname>
Thus, I was running valgrind on "mpirun" itself, not on the intended program. When I ran Valgrind correctly, it identified the memory corruption in an unrelated part of the code.
Kudos to another Stack Overflow Q/A for helping me figure this out: Using valgrind to spot error in mpi code
I am working on a small Linux embedded network bridged device (with 2.6 Tickless Kernel).
I am trying to reduce cpu wakeup time (When the system is idle) and check which timers can be avoided.
I have noticed that although STP is explicitly disabled on the bridge, hello timer and hold timers are expiring every 2 sec (default time).
echo 1 > /proc/timer_stats; sleep 30; echo 0 > /proc/timer_stats ; cat /proc/timer_stats
Timer Stats Version: v0.2
Sample period: 30.016 s
1, 1 init hrtimer_start_range_ns (hrtimer_wakeup)
15, 0 swapper br_transmit_config (br_hold_timer_expired)
15, 0 swapper run_timer_softirq (br_hello_timer_expired)
15, 5 events/0 worker_thread (delayed_work_timer_fn)
2D, 5 events/0 neigh_periodic_work (delayed_work_timer_fn)
1, 502 sleep hrtimer_start_range_ns (hrtimer_wakeup)
5, 0 swapper run_timer_softirq (sync_supers_timer_fn)
5, 10 bdi-default bdi_forker_task (process_timeout)
2D, 5 events/0 neigh_periodic_work (delayed_work_timer_fn)
1, 5 events/0 worker_thread (delayed_work_timer_fn)
1, 548 sleep hrtimer_start_range_ns (hrtimer_wakeup)
63 total events, 2.098 events/sec
Looking at the Kernel bridge code, it seems that these timers are set regardless of the STP mode (disable/enable).
One approach could be to make the timers expire time longer (brctl sethello br0 30), it is ok though not ideal.
A different approach would be to patch the kernel so timer initialization will not take place when STP is disabled, patching the kernel is also not ideal.
Is there a reason that these timers are initialized even though STP is disabled ?
Does anyone have a different idea/approach ?
Thanks
I have a problem with a hardfault that appears at seemingly random times where a pointer is pointing to address A5 or FF (my allowed memory space is far below that at 80000000 and up). It seems to always be the same pointer with these two values.
I'm using an embedded system running a STM32F205RE processor which communicates to a fm/bluetooth/gps chip called cg2900 where this error occurs.
Using a debugger I can see that the pointer is pointing to address A5 and FF respectively during a few testruns. However it seems to happen at random times, sometimes I can run the test for an hour without a failure while other times it crashes 20 seconds in.
I'm running freeRTOS as a scheduler to switch between different tasks (one for radio, one for bluetooth, one for other periodical maintenance) which might interfere somehow.
What can be the cause of this? As it's running custom hardware it can not be ruled out that it's a hardware issue (potentially). Any pointers (no pun intended) on how to approach debugging the issue?
EDIT:
After further investigations it seems that it is very random where it crashes, not just that specific pointer. I used a hardfault handler to get the following values of these registers (all values in hex):
Semi-long run before crash (minutes):
R0 = 1
R1 = fffffffd
R2 = 20000400
R3 = 20007f7c
R12 = 7
LR [R14] = 200000c8 subroutine call return address
PC [R15] = 1010101 program counter
PSR = 8013d0f
BFAR = e000ed38
CFSR = 10000
HFSR = 40000000
DFSR = 0
AFSR = 0
SCB_SHCSR = 0
Very short run before crash (seconds):
R0 = 40026088
R1 = fffffff1
R2 = cb3
R3 = 1
R12 = 34d
LR [R14] = 40026088 subroutine call return address
PC [R15] = a5a5a5a5 program counter
PSR = fffffffd
BFAR = e000ed38
CFSR = 100
HFSR = 40000000
DFSR = 0
AFSR = 0
SCB_SHCSR = 0
Another short one (seconds):
R0 = 0
R1 = fffffffd
R2 = 20000400
R3 = 20007f7c
R12 = 7
LR [R14] = 200000c8 subroutine call return address
PC [R15] = 1010101 program counter
PSR = 8013d0f
BFAR = e000ed38
CFSR = 1
HFSR = 40000000
DFSR = 0
AFSR = 0
SCB_SHCSR = 0
After a very long run (1hour +):
R0 = e80000d0
R1 = fffffffd
R2 = 20000400
R3 = 2000877c
R12 = 7
LR [R14] = 200000c8 subroutine call return address
PC [R15] = 1010101 program counter
PSR = 8013d0f
BFAR = 200400d4
CFSR = 8200
HFSR = 40000000
DFSR = 0
AFSR = 0
SCB_SHCSR = 0
Seems to crash at the same point most of the time. I adjusted the memory according to previous suggestions but I still seem to have the same issue.
Thanks for your time!
Kind regards
In your comment you mention that this pointer is explicitly assigned once then never written to. In that case you should at least declare it const and use initialisation rather than assignment, e.g.
arraytype* const ptr = array ;
that will allow the compiler to detect any explicit writes. However it is more likely that the pointer is being corrupted by some unrelated coding error.
The Coretx-M3 on chip debug supports data access breakpoints; you should set such a breakpoint over the pointer in question so that all write accesses to it are trapped. You will get a break on initialisation, then after that on modification - intentional or otherwise.
Likely causes are overrun of an adjacent array or of a thread stack.
If you tried to relocate the array and continues with the same problem,
then some task is overflowing.
As you mentioned, you are using FreeRTOS, and because the behavior is random is likely that something is wrong with your settings STACK_SIZE in calls to xTaskCreate
This usually happens when the allocated size is less than you really need.
If you read the documentation about usStackDepth, you noticed that represents a multiplier and not the number of bytes.
I personally would exclude hardware problems in your embedded board and I would focus on the configuration problems of FreeRTOS
Turns it that the problem was caused by the memory storage. As I was running the processor at it's highest speed (120 Mhz) and used 1.8 volts supply (it's designed mainly for 3 volts) I had some race conditions with the memory. Resolved it by using a higher wait state.
I haven't ready the full question, but it sounds like the 0xa5a5a5a5a5 value is a symptom of bit decompression, similar to bit rot. Basically, the laminated structure of 0xffffffff = 0b11111111111111111111111111111111 starts to peel apart and the 1's drift away from each other. They can even start to intrude into neighbouring words.