OpenCL kernel fails to compile asking for address space qualifier - opencl

The following opencl code fails to compile.
typedef struct {
double d;
double* da;
long* la;
uint ui;
} MyStruct;
__kernel void MyKernel (__global MyStruct* s) {
}
The error message is as follows.
line 11: error: kernel pointer arguments must point to addrSpace global, local, or constant
__kernel void MyKernel (__global MyStruct* s) {
^
As you can see I have clearly qualified the argument with '__global' as the error suggests I should. What am I doing wrong and how can I resolve this error?
Obviously this happens during kernel compilation so I haven't posted my host code here as it doesn't even get further than this.
Thanks.

I think the problem is that you have pointers in your struct, which is not allowed. You cannot point to host memory from your kernel like that, so pointers in kernel argument structs don't make much sense. Variable-sized arrays are backed up in OpenCL by a cl_mem host object, and that counts for one whole argument, so as far as I know, you can only pass variable-sized arrays directly as a kernel argument (and adjust the number of work units accordingly, of course).
You might prefer to put size information in your struct and pull out the arrays as standalone kernel arguments.

Related

Frama-c : Trouble understanding WP memory models

I'm looking for WP options/model that could allow me to prove basic C memory manipulations like :
memcpy : I've tried to prove this simple code :
struct header_src{
char t1;
char t2;
char t3;
char t4;
};
struct header_dest{
short t1;
short t2;
};
/*# requires 0<=n<=UINT_MAX;
# requires \valid(dest);
# requires \valid_read(src);
# assigns (dest)[0..n-1] \from (src)[0..n-1];
# assigns \result \from dest;
# ensures dest[0..n] == src[0..n];
# ensures \result == dest;
*/
void* Frama_C_memcpy(char *dest, const char *src, uint32_t n);
int main(void)
{
struct header_src p_header_src;
struct header_dest p_header_dest;
p_header_src.t1 = 'e';
p_header_src.t2 = 'b';
p_header_src.t3 = 'c';
p_header_src.t4 = 'd';
p_header_dest.t1 = 0x0000;
p_header_dest.t2 = 0x0000;
//# assert \valid(&p_header_dest);
Frama_C_memcpy((char*)&p_header_dest, (char*)&p_header_src, sizeof(struct header_src));
//# assert p_header_dest.t1 == 0x6265;
//# assert p_header_dest.t2 == 0x6463;
}
but the two last assert weren't verified by WP (with default prover Alt-Ergo). It can be proved thanks to Value analysis, but I mostly want to be able to prove the code not using abstract interpretation.
Cast pointer to int : Since I'm programming embedded code, I want to be able to specify something like:
#define MEMORY_ADDR 0x08000000
#define SOME_SIZE 10
struct some_struct {
uint8_t field1[SOME_SIZE];
uint32_t field2[SOME_SIZE];
}
// [...]
// some function body {
struct some_struct *p = (some_struct*)MEMORY_ADDR;
if(p == NULL) {
// Handle error
} else {
// Do something
}
// } body end
I've looked a little bit at WP's documentation and it seems that the version of frama-c that I use (Magnesium-20151002) has several memory model (Hoare, Typed , +cast, +ref, ...) but none of the given example were proved with any of the model above. It is explicitly said in the documentation that Typed model does not handle pointer-to-int casts. I've a lot of trouble to understand what's really going on under the hood with each wp-model. It would really help me if I was able to verify at least post-conditions of the memcpy function. Plus, I have seen this issue about void pointer that apparently are not very well handled by WP at least in the Magnesium version. I didn't tried another version of frama-c yet, but I think that newer version handle void pointer in a better way.
Thank you very much in advance for your suggestions !
memcpy
Reasoning about the result of memcpy (or Frama_C_memcpy) is out of range of the current WP plugin. The only memory model that would work in your case is Bytes (page 13 of the manual for Chlorine), but it is not implemented.
Independently, please note that your postcondition from Frama_C_memcpy is not what you want here. You are asserting the equality of the sets dest[0..n] and src[0..n]. First, you should stop at n-1. Second, and more importantly, this is far too weak, and is in fact not sufficient to prove the two assertions in the caller. What you want is a quantification on all bytes. See e.g. the predicate memcmp in Frama-C's stdlib, or the variant \forall int i; 0 <= i < n -> dest[i] == src[i];
By the way, this postcondition holds only if dest and src are properly separated, which your function does not require. Otherwise, you should write dest[i] == \at (src[i], Pre). And your requires are also too weak for another reason, as you only require the first character to be valid, not the n first ones.
Cast pointer to int
Basically, all current models implemented in WP are unable to reason on codes in which the memory is accessed with multiple incompatible types (through the use of unions or pointer casts). In some cases, such as Typed, the cast is detected syntactically, and a warning is issued to warn that the code cannot be analyzed. The Typed+Cast model is a variant of Typed in which some casts are accepted. The analysis is correct only if the pointer is re-cast into its original type before being used. The idea is to allow using some libc functions that require a cast to void*, but not much more.
Your example is again a bit different, because it is likely that MEMORY_ADDR is always addressed with type some_stuct. Would it be acceptable to change the code slightly, and change your function as taking a pointer to this type? This way, you would "hide" the cast to MEMORY_ADDR inside a function that would remain unproven.
I tried this example in the latest version of Frama-C (of course the format is modified a little bit).
for the memcpy case
Assertion 2 fails but assertion 3 is successfully proved (basically because the failure of assertion 2 leads to a False assumption, which proves everything).
So in fact both assertion cannot be proved, same as your problem.
This conclusion is sound because the memory models used in the wp plugin (as far as I know) has no assumption on the relation between fields in a struct, i.e. in header_src the first two fields are 8 bit chars, but they may not be nestedly organized in the physical memory like char[2]. Instead, there may be paddings between them (refer to wiki for detailed description). So when you try to copy bits in such a struct to another struct, Frama-C becomes completely confused and has no idea what you are doing.
As far as I am concerned, Frama-C does not support any approach to precisely control the memory layout, e.g. gcc's PACKED which forces the compiler to remove paddings.
I am facing the same problem, and the (not elegant at all) solution is, use arrays instead. Arrays are always nested, so if you try to copy a char[4] to a short[2], I think the assertion can be proved.
for the Cast pointer to int case
With memory model Typed+cast, the current version I am using (Chlorine-20180501) supports casting between pointers and uint64_t. You may want to try this version.
Moreover, it is strongly suggested to call Z3 and CVC4 through why3, whose performance is certainly better than Alt-Ergo.

Local memory using C++ Wrappers

I wish to use local work groups for my kernels, but I'm having some issues passing the 'NULL' parameters to my kernels. I hope to know how to pass these parameters using the methods that I'm using which I will show below, as opposed to setArg which I saw here: How to declare local memory in OpenCL?
I have the following host code for my kernel:
initialized in a .h file:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer>> setInputKernel;
in host code:
this->setInputKernel.reset(new cl::make_kernel<cl::Buffer, cl::Buffer>(program, "setInputs"));
enqueue kernel code:
(*setInputKernel)(cl::EnqueueArgs(*queue, cl::NDRange(1000),cl::NDRange(1000)),
cl::Buffer, cl::Buffer);
kernel code:
kernel void setInputs(global float* restrict inputArr, global float* restrict inputs)
I have already set the appropriate sizes and setting for my local work group parameters. However, I did not successfully pass the data into the kernel.
The kernel with the local work group updates:
kernel void setInputs(global float* restrict inputArr, global float*
restrict inputs, local float* inputArrLoc, local float* inputsLoc)
I had tried to change my code accordingly by using NULL or cl::Buffer for the input params of the kernels, but didn't work:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, NULL, NULL>> setInputKernel;
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, cl::Buffer, cl::Buffer>> setInputKernel;
with the first attempt giving me compiler issues saying that the function expects a value while I did not give one, and the second attempt returning clSetKernelArg error when I try to run the kernel. In both examples, I had ensured that all the parameters for the headers and host files were consistent.
I also tried to just put NULL behind my cl::Buffers when I enqueue the kernel, but this returns an error telling me that there is no function for call.
How do I pass parameters to my kernel in my example?
There is a LocalSpaceArg type and Local helper function to do this.
The type of your kernel would be this:
cl::make_kernel<cl::Buffer, cl::Buffer, cl::LocalSpaceArg, cl::LocalSpaceArg>
You would then specify the size of the local memory allocations when you enqueue the kernel by using cl::Local(size) (where size is the number of bytes you wish to allocate).

Julia ccall outb - Problems with libc

I run the following ccall's:
status = ccall((:ioperm, "libc"), Int32, (Uint, Uint, Int32), 0x378, 5, 1)
ccall((:outb, "libc"), Void, (Uint8, Uint16), 0x00, 0x378)
After the second ccall I receive the following Error message:
ERROR: ccall: could not find function outb in library libc
in anonymous at no file
in include at ./boot.jl:245
in include_from_node1 at loading.jl:128
in process_options at ./client.jl:285
After some research and messing around I found the following information:
ioperm is in libc, but outb is not
However, both ioperm and outb are defined in the same header file <sys/io.h>
An equivalent version of C code compiles and runs smoothly.
outb in glibc, however on the system glibc is defined as libc
Same problem with full path names /lib/x86_64-linux-gnu/libc.so.6
EDIT:
Thanks for the insight #Employed Russian! I did not look closely enough to realize the extern declaration. Now, all of my above notes make total sense!
Great, we found that ioperm is a libc function that is declared in <sys/io.h>, and that outb is not in libc, but is defined in <sys/io.h> as a volatile assembly instruction.
Which library, or file path should I use?
Implementation of ccall.
However, both ioperm and outb are defined in the same header file <sys/io.h>
By "defined" you actually mean "declared". They are different. On my system:
extern int ioperm (unsigned long int __from, unsigned long int __num,
int __turn_on) __attribute__ ((__nothrow__ , __leaf__));
static __inline void
outb (unsigned char __value, unsigned short int __port)
{
__asm__ __volatile__ ("outb %b0,%w1": :"a" (__value), "Nd" (__port));
}
It should now be obvious why you can call ioperm but not outb.
Update 1
I am still lost as to how to correct the error.
You can't import outb from libc. You would have to provide your own native library, e.g.
void my_outb(unsigned char value, unsigned short port) {
outb(value, port);
}
and import my_outb from it. For symmetry, you should probably implement my_ioperm the same way, so you are importing both functions from the same native library.
Update 2
Making a library worked, but in terms of performance it is horrible.
I guess that's why the original is implemented as an inline function: you are only executing a single outb instruction, so the overhead of a function call is significant.
Unoptimized python does x5 better.
Probably by having that same outb instruction inlined into it.
Do you know if outb exist in some other library, not in libc
That is not going to help: you will still have a function call overhead. I am guessing that when you call the imported function from Julia, you probably execute a dlopen and dlsym call, which would impose an overhead of additional several 100s of instructions.
There is probably a way to "bind" the function dynamically once, and then use it repeatedly to make the call (thus avoiding repeated dlopen and dlsym). That should help.

How to pass char pointer into opencl kernel?

I am trying to pass a char pointer into the kernel function of opencl as
char *rp=(char*)malloc(something);
ciErr=clSetKernelArg(ckKernel,0,sizeof(cl_char* ),(char *)&rp)
and my kernel is as
__kernel void subFilter(char *rp)
{
do something
}
When I am running the kernel I am getting
error -48 in clsetkernelargs 1
Also, I tried to modify the kernel as
__kernel void subFilter(__global char *rp)
{
do something
}
I got error as
error -38 in clsetkernelargs 1
which says invalid mem object .
i just want to access the memory location pointed by the rp in the kernel.
Any help would be of great help.
Thnaks,
Piyush
Any arrays and memory objects that you use in an OpenCL kernel needed to be allocated via the OpenCL API (e.g. using clCreateBuffer). This is because the host and device don't always share the same physical memory. A pointer to data that is allocated on the host (via malloc) means absolutely nothing to a discrete GPU for example.
To pass an array of characters to an OpenCL kernel, you should write something along the lines of:
char *h_rp = (char*)malloc(length);
cl_mem d_rp = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, length, h_rp, &err);
err = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), &d_rp)
and declare the argument with the __global (or __constant) qualifier in your kernel. You can then copy the data back to the host with clEnqueueReadBuffer.
If you do know that host and device share the same physical memory, then you can allocate memory that is visible to both host and device by creating a buffer with the CL_MEM_ALLOC_HOST_PTR flag, and using clEnqueueMapMemObject when you wish to access the data from the host. The new shared-virtual-memory (SVM) features of OpenCL 2.0 also improve the way that you can share buffers between host and device on unified-memory architectures.

Passing a pointer to device memory between classes in CUDA

I would appreciate some help involving CUDA device memory pointers. Basically I want to split my CUDA kernel code into multiple files for readability and because it is a large program. So what I want to do is be able to pass the same device memory pointers to multiple CUDA kernels, not simultaneously. Below is a rough example of what I need
//random.h
class random{
public:
int* dev_pointer_numbers;
};
so the object simply needs to store the pointer to device memory
//random_kernel.cu
__global__ void doSomething(int *values){
//do some processing}
extern "C" init_memory(int *devPtr,int *host_memory,int arraysize)
{
cudaMalloc(&devPtr,arraysize*sizeof(int));
cudaMemcpy(devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
extern "C" runKernel(int *devPtr){
doSomething<<<1,1>>>(devPtr);
}
and the main file:
//main.cpp
//ignoring all the details etc
random rnd;
void CUDA(int *hostArray)
{
init_memory(rnd.dev_pointer_numbers,hostArray,10);
runKernel(rnd.dev_pointer_numbers);
}
I understand that when I run the kernel code with the object pointer it isnt mapped in device memory so thats why the kernel code fails. What I want to know is how can I store to the pointer to a particular block in device memory in my main file so that it can be reused amongst other cuda kernel files?
You're losing your pointer!
Check out your init_memory function:
init_memory(int *devPtr,int *host_memory,int arraysize)
{
cudaMalloc(&devPtr,arraysize*sizeof(int));
cudaMemcpy(devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
So you pass in a pointer, at which point you have a local copy named devPtr. Then you call cudaMalloc() with the address of the local copy of the pointer. When the function returns the local copy (on the stack) is destroyed, so you have lost the pointer.
Instead try this:
init_memory(int **devPtr,int *host_memory,int arraysize)
{
cudaMalloc(devPtr,arraysize*sizeof(int));
cudaMemcpy(*devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
...
init_memory(&rnd.dev_pointer_numbers,hostArray,10);
As a side note, consider removing the extern "C", since you're calling this from C++ (main.cpp) there's no point and it just clutters your code.

Resources