Exit early on found in OpenCL

Exit early on found in OpenCL - opencl

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,

One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

Related

Using vloadn (opencl) to load unallocated memory

I am using vloadn to load data and as a parameter I pass the range I want to read and it works, but I am wondering what's the behavior of vload4. If this might cause some unexpected issue or I am perfectly safe to do this. An example might be something like this:
__kernel void myKernel(__global float* data_ptr, int size)
{
float4 vec = vload4(0, data_ptr);
float sum = 0.f;
// data_ptr points to an array of 2 floats in global mem
if (size == 2) {
sum += vec.s1;
sum += vec.s0;
}
else if (size == 1) {
sum += vec.s0;
}
}
data_ptr is an array of 2 floats in global memory, but even though I am accessing only those 2 floats, I am loading 4 floats using vload4. The reason I am asking is that I want to use a single vloadn and decide afterwards how much of it I actually want to use and not to use vloadn based on size (e.g. for size==4 use vload4, for size==8 vload8 etc.

If it's still within data_ptr it will be fine; you don't have to use all the data you read. However, if you read past either end of the buffer that data_ptr points to you can have problems (memory read exception, for example, or some other device-dependent error). Note: Check the address alignment requirements for vload to see if you're allowed to read at any address or only multiple of with size.

Frama-c : Trouble understanding WP memory models

I'm looking for WP options/model that could allow me to prove basic C memory manipulations like :
memcpy : I've tried to prove this simple code :
struct header_src{
char t1;
char t2;
char t3;
char t4;
};
struct header_dest{
short t1;
short t2;
};
/*# requires 0<=n<=UINT_MAX;
# requires \valid(dest);
# requires \valid_read(src);
# assigns (dest)[0..n-1] \from (src)[0..n-1];
# assigns \result \from dest;
# ensures dest[0..n] == src[0..n];
# ensures \result == dest;
*/
void* Frama_C_memcpy(char *dest, const char *src, uint32_t n);
int main(void)
{
struct header_src p_header_src;
struct header_dest p_header_dest;
p_header_src.t1 = 'e';
p_header_src.t2 = 'b';
p_header_src.t3 = 'c';
p_header_src.t4 = 'd';
p_header_dest.t1 = 0x0000;
p_header_dest.t2 = 0x0000;
//# assert \valid(&p_header_dest);
Frama_C_memcpy((char*)&p_header_dest, (char*)&p_header_src, sizeof(struct header_src));
//# assert p_header_dest.t1 == 0x6265;
//# assert p_header_dest.t2 == 0x6463;
}
but the two last assert weren't verified by WP (with default prover Alt-Ergo). It can be proved thanks to Value analysis, but I mostly want to be able to prove the code not using abstract interpretation.
Cast pointer to int : Since I'm programming embedded code, I want to be able to specify something like:
#define MEMORY_ADDR 0x08000000
#define SOME_SIZE 10
struct some_struct {
uint8_t field1[SOME_SIZE];
uint32_t field2[SOME_SIZE];
}
// [...]
// some function body {
struct some_struct *p = (some_struct*)MEMORY_ADDR;
if(p == NULL) {
// Handle error
} else {
// Do something
}
// } body end
I've looked a little bit at WP's documentation and it seems that the version of frama-c that I use (Magnesium-20151002) has several memory model (Hoare, Typed , +cast, +ref, ...) but none of the given example were proved with any of the model above. It is explicitly said in the documentation that Typed model does not handle pointer-to-int casts. I've a lot of trouble to understand what's really going on under the hood with each wp-model. It would really help me if I was able to verify at least post-conditions of the memcpy function. Plus, I have seen this issue about void pointer that apparently are not very well handled by WP at least in the Magnesium version. I didn't tried another version of frama-c yet, but I think that newer version handle void pointer in a better way.
Thank you very much in advance for your suggestions !

memcpy
Reasoning about the result of memcpy (or Frama_C_memcpy) is out of range of the current WP plugin. The only memory model that would work in your case is Bytes (page 13 of the manual for Chlorine), but it is not implemented.
Independently, please note that your postcondition from Frama_C_memcpy is not what you want here. You are asserting the equality of the sets dest[0..n] and src[0..n]. First, you should stop at n-1. Second, and more importantly, this is far too weak, and is in fact not sufficient to prove the two assertions in the caller. What you want is a quantification on all bytes. See e.g. the predicate memcmp in Frama-C's stdlib, or the variant \forall int i; 0 <= i < n -> dest[i] == src[i];
By the way, this postcondition holds only if dest and src are properly separated, which your function does not require. Otherwise, you should write dest[i] == \at (src[i], Pre). And your requires are also too weak for another reason, as you only require the first character to be valid, not the n first ones.
Cast pointer to int
Basically, all current models implemented in WP are unable to reason on codes in which the memory is accessed with multiple incompatible types (through the use of unions or pointer casts). In some cases, such as Typed, the cast is detected syntactically, and a warning is issued to warn that the code cannot be analyzed. The Typed+Cast model is a variant of Typed in which some casts are accepted. The analysis is correct only if the pointer is re-cast into its original type before being used. The idea is to allow using some libc functions that require a cast to void*, but not much more.
Your example is again a bit different, because it is likely that MEMORY_ADDR is always addressed with type some_stuct. Would it be acceptable to change the code slightly, and change your function as taking a pointer to this type? This way, you would "hide" the cast to MEMORY_ADDR inside a function that would remain unproven.

I tried this example in the latest version of Frama-C (of course the format is modified a little bit).
for the memcpy case
Assertion 2 fails but assertion 3 is successfully proved (basically because the failure of assertion 2 leads to a False assumption, which proves everything).
So in fact both assertion cannot be proved, same as your problem.
This conclusion is sound because the memory models used in the wp plugin (as far as I know) has no assumption on the relation between fields in a struct, i.e. in header_src the first two fields are 8 bit chars, but they may not be nestedly organized in the physical memory like char[2]. Instead, there may be paddings between them (refer to wiki for detailed description). So when you try to copy bits in such a struct to another struct, Frama-C becomes completely confused and has no idea what you are doing.
As far as I am concerned, Frama-C does not support any approach to precisely control the memory layout, e.g. gcc's PACKED which forces the compiler to remove paddings.
I am facing the same problem, and the (not elegant at all) solution is, use arrays instead. Arrays are always nested, so if you try to copy a char[4] to a short[2], I think the assertion can be proved.
for the Cast pointer to int case
With memory model Typed+cast, the current version I am using (Chlorine-20180501) supports casting between pointers and uint64_t. You may want to try this version.
Moreover, it is strongly suggested to call Z3 and CVC4 through why3, whose performance is certainly better than Alt-Ergo.

OpenCL Programm freezes everytime private variables used

I got a kernel method, simplified looks like this.
__kernel void calculate (__global float *a, __global float *b, __global int *res) {
int workItem = get_global_id(0); // Syntax may not right, but you get the idea
int found = 0;
for (int i = 0; i<100000000; i++) {
float c = a[i]*3;
float d = b[i]*2;
if (c<d) {
found++;
}
}
res[workItem]= found;
}
So, nothing much but simple calculation and a very big loop, the problem is the programm freezes all the time when i run this code. I have to force reset the computer everytime this happens.
But if i make some changes, like this
if (true) {
found++;
}
Or
if (1<2) {
found++
}
Then the programm works like a charm, and very fast! So i wonder if is there any thing wrong with variables c and d ? I tried to use things like
__private float c= ..;
__private float d= ..;
It didnt work either.
I read the return code of every step while creating programm and kernel, so its not the problem.
What did I do wrong here ?

You have several issues here.
You are running a giant loop, presumably with a single work item, which is exactly what you want to avoid in OpenCL, as the entire concept of parallelism is lost. OpenCL will not magically make your code fast if you don't use it correctly.
Even if you are using multiple work items, you are not making use of the value you retrieve from get_global_id() except for output. But, as they're all using the same input, you'll get the same output for every single work item!
Work items and their associated global IDs are intended to allow you to partition your processing into discrete units, rather than one big monolithic loop. I suggest you look at some tutorials like this one to understand the concept better. Don't start writing your own code until you understand his.
As for why your program freezes your PC, I can only speculate without seeing your host code. Perhaps you are getting buffer overrun?

Is there a minimum string length for F() to be useful?

Is there a limit for short strings where using the F() macro brings more RAM overhead then saving?
For (a contrived) example:
Serial.print(F("\n"));
Serial.print(F("Hi"));
Serial.print(F("there!"));
Serial.print(F("How do you doyou how?"));
Would any one of those be more efficient without the F()?
I imagine it uses some RAM to iterate over the string and copy it from PROGMEM to RAM. I guess the question is: how much? Also, is heap fragmentation a concern here?
I'm looking at this purely from SRAM-conserving perspective.

From a purely SRAM-conserving perspective all of your examples are identical in that no SRAM is used. At run-time some RAM is used, but only momentarily on the stack. Keep in mind that calling println() (w/o any parameters) uses some stack/RAM.
For a single character it will take up less space in flash if a char is passed into print or println. For example:
Serial.print('\n');
The char will be in flash (not static RAM).
Using
Serial.print(F("\n"));
will create a string in flash memory that is two bytes long (newline char + null terminator) and will additionally pass a pointer to that string to print which is probably two bytes long.
Additionally at runtime, using the F macro will result in two fetches ('\n' and the null terminator) from flash. While fetches from flash are fast, passing in a char results in zero fetches from flash, which is a tiny bit faster.

I don't think there is any minimum size of the string to be useful. If you look at how the outputting is implemented in Print.cpp:
size_t Print::print(const __FlashStringHelper *ifsh)
{
PGM_P p = reinterpret_cast<PGM_P>(ifsh);
size_t n = 0;
while (1) {
unsigned char c = pgm_read_byte(p++);
if (c == 0) break;
n += write(c);
}
return n;
}
You can see from there that only one byte of RAM is used at a time (plus a couple of variables), as it pulls the string from PROGMEM a byte at a time. These are all on the stack so there is no ongoing overhead.
I imagine it uses some RAM to iterate over the string and copy it from PROGMEM to RAM. I guess the question is: how much?
No, it doesn't as I showed above. It outputs a byte at a time. There is no copying (in bulk) of the string into RAM first.
Also, is heap fragmentation a concern here?
No, the code does not use the heap.

Pointer Trouble

I was trying some basic pointer manipulation and have a issue i would like clarified. Here is the code snippet I am referring to
int arr[3] = {0};
*(arr+0) = 12;
*(arr+1) = 24;
*(arr+2) = 74;
*(arr+3) = 55;
cout<<*(arr+3)<<"\t"<<(long)(arr+3)<<endl;
//cout<<"Address of array arr : "<<arr<<endl;
cout<<(long)(arr+0)<<"\t"<<(long)(arr+1)<<"\t"<<(long)(arr+2)<<endl;;
for(int i=0;i<4;i++)
cout<<*(arr+i)<<"\t"<<i<<"\t"<<(long)(arr+i)<<endl;
//*(arr+3) = 55;
cout<<*(arr+3)<<endl<<endl;
My problem is:
When I try to acces arr+3 outside the for-loop , I get the desired value 55 printed. But when I try to access it through the for loop, I get some different value(3 in this case). After the for loop, it is printing the value as 4. Could someone explain to me what is happening? Thanks in advance..

You have created an array of size 3 and you are trying to access the 4th element. The outcome is therefore undefined.
Since you allocate the array in the stack, the first time you try to write the 4th element, you are actually writing beyond the space that was allocated for the stack. In Debug mode this will work, but in Release your program will probably crash.
The second time you are reading the value at the 4th place you are reading the value 4. This makes sense, as the compiler has allocated the stack space after the array for variable i, which after the loop has finished executing will have the value 4.

As array has been defined with 3 elements, data will be stored sequentially like 12,24,74. When you assign 55 for 4th element, it is stored somewhere else in memory, not sequentially. First time, Compiler prints it correctly, but then it is not able to handle memory so it prints garbage value.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex