I would appreciate some help involving CUDA device memory pointers. Basically I want to split my CUDA kernel code into multiple files for readability and because it is a large program. So what I want to do is be able to pass the same device memory pointers to multiple CUDA kernels, not simultaneously. Below is a rough example of what I need
//random.h
class random{
public:
int* dev_pointer_numbers;
};
so the object simply needs to store the pointer to device memory
//random_kernel.cu
__global__ void doSomething(int *values){
//do some processing}
extern "C" init_memory(int *devPtr,int *host_memory,int arraysize)
{
cudaMalloc(&devPtr,arraysize*sizeof(int));
cudaMemcpy(devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
extern "C" runKernel(int *devPtr){
doSomething<<<1,1>>>(devPtr);
}
and the main file:
//main.cpp
//ignoring all the details etc
random rnd;
void CUDA(int *hostArray)
{
init_memory(rnd.dev_pointer_numbers,hostArray,10);
runKernel(rnd.dev_pointer_numbers);
}
I understand that when I run the kernel code with the object pointer it isnt mapped in device memory so thats why the kernel code fails. What I want to know is how can I store to the pointer to a particular block in device memory in my main file so that it can be reused amongst other cuda kernel files?
You're losing your pointer!
Check out your init_memory function:
init_memory(int *devPtr,int *host_memory,int arraysize)
{
cudaMalloc(&devPtr,arraysize*sizeof(int));
cudaMemcpy(devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
So you pass in a pointer, at which point you have a local copy named devPtr. Then you call cudaMalloc() with the address of the local copy of the pointer. When the function returns the local copy (on the stack) is destroyed, so you have lost the pointer.
Instead try this:
init_memory(int **devPtr,int *host_memory,int arraysize)
{
cudaMalloc(devPtr,arraysize*sizeof(int));
cudaMemcpy(*devPtr,host_memory,arraysize*sizeof(int),cudaMemcpyHostToDevice);
}
...
init_memory(&rnd.dev_pointer_numbers,hostArray,10);
As a side note, consider removing the extern "C", since you're calling this from C++ (main.cpp) there's no point and it just clutters your code.
Related
I wish to use local work groups for my kernels, but I'm having some issues passing the 'NULL' parameters to my kernels. I hope to know how to pass these parameters using the methods that I'm using which I will show below, as opposed to setArg which I saw here: How to declare local memory in OpenCL?
I have the following host code for my kernel:
initialized in a .h file:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer>> setInputKernel;
in host code:
this->setInputKernel.reset(new cl::make_kernel<cl::Buffer, cl::Buffer>(program, "setInputs"));
enqueue kernel code:
(*setInputKernel)(cl::EnqueueArgs(*queue, cl::NDRange(1000),cl::NDRange(1000)),
cl::Buffer, cl::Buffer);
kernel code:
kernel void setInputs(global float* restrict inputArr, global float* restrict inputs)
I have already set the appropriate sizes and setting for my local work group parameters. However, I did not successfully pass the data into the kernel.
The kernel with the local work group updates:
kernel void setInputs(global float* restrict inputArr, global float*
restrict inputs, local float* inputArrLoc, local float* inputsLoc)
I had tried to change my code accordingly by using NULL or cl::Buffer for the input params of the kernels, but didn't work:
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, NULL, NULL>> setInputKernel;
std::shared_ptr<cl::make_kernel<cl::Buffer, cl::Buffer, cl::Buffer, cl::Buffer>> setInputKernel;
with the first attempt giving me compiler issues saying that the function expects a value while I did not give one, and the second attempt returning clSetKernelArg error when I try to run the kernel. In both examples, I had ensured that all the parameters for the headers and host files were consistent.
I also tried to just put NULL behind my cl::Buffers when I enqueue the kernel, but this returns an error telling me that there is no function for call.
How do I pass parameters to my kernel in my example?
There is a LocalSpaceArg type and Local helper function to do this.
The type of your kernel would be this:
cl::make_kernel<cl::Buffer, cl::Buffer, cl::LocalSpaceArg, cl::LocalSpaceArg>
You would then specify the size of the local memory allocations when you enqueue the kernel by using cl::Local(size) (where size is the number of bytes you wish to allocate).
I run the following ccall's:
status = ccall((:ioperm, "libc"), Int32, (Uint, Uint, Int32), 0x378, 5, 1)
ccall((:outb, "libc"), Void, (Uint8, Uint16), 0x00, 0x378)
After the second ccall I receive the following Error message:
ERROR: ccall: could not find function outb in library libc
in anonymous at no file
in include at ./boot.jl:245
in include_from_node1 at loading.jl:128
in process_options at ./client.jl:285
After some research and messing around I found the following information:
ioperm is in libc, but outb is not
However, both ioperm and outb are defined in the same header file <sys/io.h>
An equivalent version of C code compiles and runs smoothly.
outb in glibc, however on the system glibc is defined as libc
Same problem with full path names /lib/x86_64-linux-gnu/libc.so.6
EDIT:
Thanks for the insight #Employed Russian! I did not look closely enough to realize the extern declaration. Now, all of my above notes make total sense!
Great, we found that ioperm is a libc function that is declared in <sys/io.h>, and that outb is not in libc, but is defined in <sys/io.h> as a volatile assembly instruction.
Which library, or file path should I use?
Implementation of ccall.
However, both ioperm and outb are defined in the same header file <sys/io.h>
By "defined" you actually mean "declared". They are different. On my system:
extern int ioperm (unsigned long int __from, unsigned long int __num,
int __turn_on) __attribute__ ((__nothrow__ , __leaf__));
static __inline void
outb (unsigned char __value, unsigned short int __port)
{
__asm__ __volatile__ ("outb %b0,%w1": :"a" (__value), "Nd" (__port));
}
It should now be obvious why you can call ioperm but not outb.
Update 1
I am still lost as to how to correct the error.
You can't import outb from libc. You would have to provide your own native library, e.g.
void my_outb(unsigned char value, unsigned short port) {
outb(value, port);
}
and import my_outb from it. For symmetry, you should probably implement my_ioperm the same way, so you are importing both functions from the same native library.
Update 2
Making a library worked, but in terms of performance it is horrible.
I guess that's why the original is implemented as an inline function: you are only executing a single outb instruction, so the overhead of a function call is significant.
Unoptimized python does x5 better.
Probably by having that same outb instruction inlined into it.
Do you know if outb exist in some other library, not in libc
That is not going to help: you will still have a function call overhead. I am guessing that when you call the imported function from Julia, you probably execute a dlopen and dlsym call, which would impose an overhead of additional several 100s of instructions.
There is probably a way to "bind" the function dynamically once, and then use it repeatedly to make the call (thus avoiding repeated dlopen and dlsym). That should help.
I'll use a simple specific example to illustrate what I'm trying to do.
file main.c:
#include <stdio.h>
unsigned int X;
int main()
{
printf("&X = 0x%zX\r\n", &X);
return 0;
}
I want to know if it's possible (using a linker-script/gcc options) to manually specify an address for X at compile/link time, because I know it lies somewhere in memory, outside my executable.
I only want to know if this is possible, I know I can use a pointer (i.e. unsigned int*) to access a specific memory location (r/w) but that's not what I'm after.
What I'm after is making GCC generate code in which all accesses to global variables/static function variables are either done through a level of indirection, i.e. through a pointer (-fPIC not good enough because static global vars are not accessed via GOT) or their addresses can be manually specified (at link/compile time).
Thank you
What I'm after is making GCC generate code in which all accesses to
global variables/static function variables … their addresses can be
manually specified (at link/compile time).
You can specify the addresses of the .bss and .data sections (which contain the uninitialized and initialized variables respectively) with linker commands. The relative placement of the variables in the sections is up to the compiler/linker.
If you need only individual variables to be placed, this can be done by declaring them extern and specifying their addresses in a file, e. g. addresses.ld:
X = 0x12345678;
(note: spaces around = needed), which is added to the compiler/linker arguments:
cc main.c addresses.ld
I am trying to pass a char pointer into the kernel function of opencl as
char *rp=(char*)malloc(something);
ciErr=clSetKernelArg(ckKernel,0,sizeof(cl_char* ),(char *)&rp)
and my kernel is as
__kernel void subFilter(char *rp)
{
do something
}
When I am running the kernel I am getting
error -48 in clsetkernelargs 1
Also, I tried to modify the kernel as
__kernel void subFilter(__global char *rp)
{
do something
}
I got error as
error -38 in clsetkernelargs 1
which says invalid mem object .
i just want to access the memory location pointed by the rp in the kernel.
Any help would be of great help.
Thnaks,
Piyush
Any arrays and memory objects that you use in an OpenCL kernel needed to be allocated via the OpenCL API (e.g. using clCreateBuffer). This is because the host and device don't always share the same physical memory. A pointer to data that is allocated on the host (via malloc) means absolutely nothing to a discrete GPU for example.
To pass an array of characters to an OpenCL kernel, you should write something along the lines of:
char *h_rp = (char*)malloc(length);
cl_mem d_rp = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, length, h_rp, &err);
err = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), &d_rp)
and declare the argument with the __global (or __constant) qualifier in your kernel. You can then copy the data back to the host with clEnqueueReadBuffer.
If you do know that host and device share the same physical memory, then you can allocate memory that is visible to both host and device by creating a buffer with the CL_MEM_ALLOC_HOST_PTR flag, and using clEnqueueMapMemObject when you wish to access the data from the host. The new shared-virtual-memory (SVM) features of OpenCL 2.0 also improve the way that you can share buffers between host and device on unified-memory architectures.
The following opencl code fails to compile.
typedef struct {
double d;
double* da;
long* la;
uint ui;
} MyStruct;
__kernel void MyKernel (__global MyStruct* s) {
}
The error message is as follows.
line 11: error: kernel pointer arguments must point to addrSpace global, local, or constant
__kernel void MyKernel (__global MyStruct* s) {
^
As you can see I have clearly qualified the argument with '__global' as the error suggests I should. What am I doing wrong and how can I resolve this error?
Obviously this happens during kernel compilation so I haven't posted my host code here as it doesn't even get further than this.
Thanks.
I think the problem is that you have pointers in your struct, which is not allowed. You cannot point to host memory from your kernel like that, so pointers in kernel argument structs don't make much sense. Variable-sized arrays are backed up in OpenCL by a cl_mem host object, and that counts for one whole argument, so as far as I know, you can only pass variable-sized arrays directly as a kernel argument (and adjust the number of work units accordingly, of course).
You might prefer to put size information in your struct and pull out the arrays as standalone kernel arguments.