klee with loops strange behaviour with similar code - llvm-gcc

I have a question about how is working KLEE (symbolic execution tool) in case of loops with symbolic parameters:
int loop(int data) {
int i, result=0;
for (i=0;i<data ;i++){
result+= 1;
//printf("result%d\n", result); //-- With this line klee give different values for data
}
return result;
}
void main() {
int n;
klee_make_symbolic(&n, sizeof(int),"n");
int result=loop(n) ;
}
If we execute klee with this code, it give only one test-case.
However, if we take out the comment of the printf(...), klee will need some type of control for stop the execution, because it will produce values for n:
--max-depth= 200
I would like to understand why klee have this different behavior, that doesn't have sense for me. Why if I don't have a printf in this code, it will not produce the same values.
I discovered that it happens whend the option --optimize is used
when it is not there is the same behavior. Someone know how is working the --optimize of Klee?
An another quieston about the same is, if in the paper that they have published, As I understand they said that their heuristics search will make it not be infinite (they useavoid starvation)
because I have it runing doesn't stop, It is true that should finish the klee execution in case of this loop?
Thanks in advance

What I want to know is why this different behavior with the option
--optimize. Thanks
In C/C++ there is concept of "undefined behaviour" (see: A Guide to Undefined Behavior in C and C++, Part1,Part 2,Part 3, What Every C Programmer Should Know About Undefined Behavior [#1/3],[#2/3],[#3/3]).
Overflow of signed integer is defined as undefined behaviour in order to allow compiler to do optimize stuff like this:
bool f(int x){ return x+1>x ? true : false ; }
let's think... x+1>x in normal algebra is always true, in this modulo algebra it is almost always true (except one case of overflow), so let's turn it into:
true
Such practices enables huge amount of great optimizations. (Btw. If you want to have defined behaviour when overflow, use unsigned integers - this feature is extensively used in cryptography algorithms implementations).
On the other hand, sometimes leads to surprising results,
like this code:
int main(){
int s=1, i=0;
while (s>0) {
++i;
s=2*s;
}
return i;
}
Being optimised into infinite loop. It's not bug! It's powerful feature! (Again.. for defined behaviour , use unsigned).
Let's generate assembly codes for above examples:
$ g++ -O1 -S -o overflow_loop-O1.s overflow_loop.cpp
$ g++ -O2 -S -o overflow_loop-O2.s overflow_loop.cpp
Check of loop part is compiled differently:
overflow_loop-O1.s:
(...)
.L2:
addl $1, %eax
cmpl $31, %eax
jne .L2
(...)
overflow_loop-O2.s:
(...)
.L2:
jmp .L2
(...)
I would advice you to check assembly of your code on different optimisation levels (gcc -S -O0 vs gcc -S -O1...-O3).
And again, nice posts about topic:
[1],
[2],
[3],
[4],
[5],
[6].

Related

Using OpenMP with GPU

Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
P.S. I apologize for my English, which is not my native language :)

UsageFault when branching to a function pointer on Cortex-M0

I'm running code on an STM32F0 (ARM Cortex-M0). I define a function pointer to a nested function:
void My_Async_Func(void *handle, void (*complete)(bool success)) {
/*
* \/ A nested function \/
*/
void receiveHandler(void) {
// This function lies at an even-numbered address
}
/* ... */
uart->rxDoneHandler = &receiveHandler;
Go(uart);
}
The nested function appears to be screwing things up. Later when I call that rxDoneHandler pointer, it tries to branch to 0x8c0c6c0c and I get a UsageFault. According to the ARM docs, this is because I'm branching to an even-numbered address on a processor that only supports the Thumb instruction set, which requires you to only branch to odd-numbered addresses.
I'm using GCC arm-none-eabi 4.9.3. My CFLAGS are:
arm-none-eabi-gcc -mcpu=cortex-m0 -mthumb -mfloat-abi=soft -O0 -g3 -Wall -fmessage-length=0 -ffunction-sections -c -MMD -MP
Things I tried
I tried calling gcc with -mthumb-interwork as suggested here. Same result.
I tried manually setting the alignment of the function with an attribute like
__attribute__((aligned(16))) void receiveHandler(void) {
}
Same result.
I tried manually adding 1 to the pointer when I call it like
(uart->rxDoneHandler + 1)();
Same result.
Am I doing something wrong, or is this a bug in the compiler?
A nested function must not be called from the outside of the enclosing function unless you're still "logically" inside that function; from the GCC page on nested functions (emphasis mine):
It is possible to call the nested function from outside the scope of its name by storing its address or passing the address to another function:
hack (int *array, int size)
{
void store (int index, int value)
{ array[index] = value; }
intermediate (store, size);
}
Here, the function intermediate receives the address of store as an argument. If intermediate calls store, the arguments given to store are used to store into array. But this technique works only so long as the containing function (hack, in this example) does not exit.
My guess is that you register uart->rxDoneHandler as a Interrupt Service Routine (or something that is called from within an ISR). You can't do that.

OpenCL function call stack size

Can I know OpenCL's function call stack size?
I'm using NVIDIA OpenCL1.2 in Ubuntu. (NVIDIA CC=5.2)
And I found some unexpected result in my testcode.
When some function invoked 64 times, the next invoked function seems like can not access the arguments.
In my thought, call stack overflow makes this progblem.
Below is my example code and result:
void testfunc(int count, int run)
{
if(run==0) return;
count++;
printf("count=%d run=%d\n", count, run);
run = run - 1;
testfunc(count, run);
}
__kernel void hello(__global int * in_img, __global int * out_img)
{
int run;
int count=0;
run = 70;
testfunc(count, run);
}
And this is the result :
count=1 run=70
count=2 run=69
count=3 run=68
count=4 run=67
count=5 run=66
count=6 run=65
count=7 run=64
.....
count=59 run=12
count=60 run=11
count=61 run=10
count=62 run=9
count=63 run=8
count=64 run=7
count=0 run=0 // <--- Why count and run values are ZERO?
count=0 run=0
count=0 run=0
count=0 run=0
count=0 run=0
count=0 run=0
Recursion is not supported in OpenCL 1.x. From AMD's Introduction to OpenCL:
Key restrictions in the OpenCL C language are:
Function pointers are not supported.
Bit-fields are not supported.
Variable length arrays are not supported.
Recursion is not supported.
No C99 standard headers such as ctypes.h,errno.h,stdlib.h,etc. can be included.
AFAIK not all implementations have a call-stack like feature at all. In these cases, and possibly in your case, any function calls are inlined in the calling scope.

Type-punning (GCC) - incrementing a pointer parameter on stack

OK, I understand that the GCC 4.x warning "dereferencing type-punned pointer will break strict-aliasing rules" is no joke and I should clean up my code.
I have code which compiles und runs fine with GCC 3.x, and would be very happy if it would do so with GCC 4.x, too. Assume I want to have the assembled code as short as possible: the function gets passed a pointer and should write some data to there. My original code uses the pointer directly on the stack (without a copy) and increments it there (note that I don't want to pass the incremented value back to the caller). You may think also of passing parameters by register - then any copy would be overhead.
So this was my "ideal" code:
void foo(void *pdataout) {
for (int i=16; i--;)
*(*reinterpret_cast<BYTE**>(&pdataout))++ = 255;
}
I tried some variant (note that the address-operator must be applied to 'pdataout' before any type-cast):
void foo(void *pdataout) {
BYTE *pdo = reinterpret_cast<BYTE*>(*reinterpret_cast<BYTE**>(&pdataout));
for (int i=16; i--;)
*pdo++ = 255;
}
and also this:
void foo(void *pdataout) {
BYTE *pdo = *reinterpret_cast<BYTE**>(&pdataout);
for (int i=16; i--;)
*pdo++ = 255;
}
Nothing pleases GCC 4.x... This last one does - but, it uses a copy of the parameter which I don't like. Is there a way to do this without the copy? I have no idea how to tell it the compiler :-(
void foo(void *pdataout) {
BYTE *pdo = reinterpret_cast<BYTE*>(pdataout);
for (int i=16; i--;)
*pdo++ = 255;
}
As far as I understand now, despite there is no more warning by GCC, using the indirection via an additional variable is not safe!
For me (as union is not usable), the only real solution is to use the -fno-strict-aliasing compiler option. Only with that, GCC is aware that pointers of different type to the same memory address can refer to the same variable.
This article finally helped me to understand strict-aliasing.

Optimizing mask function with ARM SIMD instructions

I was wondering if you could help me use NEON intrinsics to optimize this mask function. I already tried to use auto-vectorization using the O3 gcc compiler flag but the performance of the function was smaller than running it with O2, which turns off the auto-vectorization. For some reason the assembly code produced with O3 is 1,5 longer than the one with O2.
void mask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{
unsigned int ixy;
ixy = xsize * ysize;
while (ixy--)
*(s++) &= *(m++);
}
Probably I have to use the following commands:
vld1q_u32 // to load 4 integers from s and m
vandq_u32 // to execute logical and between the 4 integers from s and m
vst1q_u32 // to store them back into s
However i don't know how to do it in the most optimal way. For instance should I increase s,m by 4 after loading , anding and storing? I am quite new to NEON so I would really need some help.
I am using gcc 4.8.1 and I am compiling with the following cmd:
arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -O3 -fprefetch-loop-arrays name.c -o name
Thanks in advance
I would probably do it like this. I've included 4x loop unrolling. Preloading the cache is always a good idea and can speed things up another 25%. Since there's not much processing going on (it's mostly spending time loading and storing), it's best to load lots of registers, then process them as it gives time for the data to actually load. It assumes the data is an even multiple of 16 elements.
void fmask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{
unsigned int ixy;
uint32x4_t srcA,srcB,srcC,srcD;
uint32x4_t maskA,maskB,maskC,maskD;
ixy = xsize * ysize;
ixy /= 16; // process 16 at a time
while (ixy--)
{
__builtin_prefetch(&s[64]); // preload the cache
__builtin_prefetch(&m[64]);
srcA = vld1q_u32(&s[0]);
maskA = vld1q_u32(&m[0]);
srcB = vld1q_u32(&s[4]);
maskB = vld1q_u32(&m[4]);
srcC = vld1q_u32(&s[8]);
maskC = vld1q_u32(&m[8]);
srcD = vld1q_u32(&s[12]);
maskD = vld1q_u32(&m[12]);
srcA = vandq_u32(srcA, maskA);
srcB = vandq_u32(srcB, maskB);
srcC = vandq_u32(srcC, maskC);
srcD = vandq_u32(srcD, maskD);
vst1q_u32(&s[0], srcA);
vst1q_u32(&s[4], srcB);
vst1q_u32(&s[8], srcC);
vst1q_u32(&s[12], srcD);
s += 16;
m += 16;
}
}
I would start with the simplest one and take it as a reference for compare with future routines.
A good rule of thumb is to calculate needed things as soon as possible, not exactly when needed.
This means that instructions can take X cycles to execute, but the results are not always immediately ready, so scheduling is important
As an example, a simple scheduling schema for your case would be (pseudocode)
nn=n/4 // Assuming n is a multiple of 4
LOADI_S(0) // Load and immediately after increment pointer
LOADI_M(0) // Load and immediately after increment pointer
for( k=1; k<nn;k++){
AND_SM(k-1) // Inner op
LOADI_S(k) // Load and increment after
LOADI_M(k) // Load and increment after
STORE_S(k-1) // Store and increment after
}
AND_SM(nn-1)
STORE_S(nn-1) // Store. Not needed to increment
Leaving out these instructions from the inner loop we achieve that the ops inside don't depend on the result of the previous op.
This schema can be further extended in order to take profit of the time that otherwise would be lost waiting for the result of the previous op.
Also, as intrinsics still depend on the optimizer, see what does the compiler do under different optimization options. I prefer to use inline assembly, which is not difficult for small routines, and give you more control.

Resources