Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
Can I send 2 dimensional data to device? If yes then how can I do that? Means how will be the declaration of the buffer memory? And also how can I fetch/ use these values in kernel function/device?
Plz reply ASAP
Sure.
A cl_mem object can not contain other cl_mem objects. Thus, it is not possible to use "2D" data like this in OpenCL. (In CUDA, this is possible, because the "buffers" there are only pointers to device memory).
Usually, you can convert your data into one large cl_mem object and access it appropriately in the kernel:
__kernel void compute(__global float *data2D, int sizeX, int sizeY)
{
int ix = get_global_id(0);
int iy = get_global_id(1);
int index = ix + iy * sizeX;
float element = data2D[index];
....
}
Lets say you have a 2D buffer in C++ side.
a Buffer of type (float *)[2048];
Then you need to get address of that buffer by
float *address= &a[0][0];
Then you use that address for you cl_mem object.
You can use stack too!
float *a=new float[2048][2048];
....
....
float *address=&a[0][0];
Your opencl-side access to this area must be overlapping exactly to C++ side. Other than C++, you need to know if your matrices are row-major or colum-major or if its being array of arrays or array of objects(like java) before playing with them. If your matrix is not continuous on memory, it can fail.
There are functions to WRITE or READ to/from your buffers from/to opencl buffers. Their structure and wrappers can change from version to version (or even the language that being used in)
Related
Here are sample codes:
__kernel void my_kernel(__global float* src,
__global float* dst){
float4 a = vload4(0,src);
//do something to a
...
vstore4(a,0,dst)
}
According to OpenCL 1.2 Reference, address of global buffer src and dst must be 4-bytes aligned when using vloadn and vstoren, or the results are undefined. My question is whether OpenCL will automate aligning the global device address after completing the call to clCreateBuffer? If not, how to ensure proper alignment?(in addition, how about local memory object?)
Refer to Data Type of OpenCL. The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. So I think the answer is basically yes.
Buffers are surely aligned to a boundary bigger than 4 bytes, except you are using CL_MEM_USE_HOST_PTR.
By the way: In your code it could be better to declare the parameters as float4* instead of using vload4 and vstore4.
This question already has an answer here:
How to pass an array of vectors to cuda kernel?
(1 answer)
Closed 4 years ago.
Recently, when I try to use CUDA programming, I want send a vector to GPU memory. Someone tells me that I can use thrust::device_vector and thrust::host_vector. I also read the help document, but still don't know how to send such a vector into the kernel function.
My codes are as following:
thrust::device_vector<int> dev_firetime[1000];
__global__ void computeCurrent(thrust::device_vector<int> d_ftime)
{
int idx = blockDim.x*blockIdx.x + threadIdx.x;
printf("ftime = %d\n", d_ftime[idx]);
}
In fact, I don't know how to send the vector to kernel function. If you know, please tell me something about this question, and are there any better way to complete the same function?
Thanks very much!
Thrust device vectors cannot be passed directly to CUDA kernels. You need to pass a pointer to the underlying device memory to the kernel. This can be done like this:
__global__ void computeCurrent(int* d_ftime)
{
int idx = blockDim.x*blockIdx.x + threadIdx.x;
printf("ftime = %d\n", d_ftime[idx]);
}
thrust::device_vector<int> dev_firetime(1000);
int* d_ftime = thrust::raw_pointer_cast<int*>(dev_firetime.data());
computeCurrent<<<....>>>(d_ftime);
If you have an array of vectors, you need to do something like what is described here.
I'm following the example here to create a variable-length local memory array.
The kernel signature is something like this:
__kernel void foo(__global float4* ex_buffer,
int ex_int,
__local void *local_var)
Then I call clSetKernelArg for the local memory kernel argument as follows:
clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)
Where MaxSharedMem is set from querying CL_DEVICE_LOCAL_MEM_SIZE.
Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE when I enqueue this kernel and then run clFinish on the queue.
This is a simple kernel that generates the mentioned error (local work size is 32):
__kernel
void s_Kernel(const unsigned int N, __local void *shared_mem_block )
{
const ushort thread_id = get_local_id(0);
__local double *foo = shared_mem_block;
__local ushort *bar = (__local ushort *) &(foo[1000]);
foo[thread_id] = 0.;
bar[thread_id] = 0;
}
The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?
For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.
As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).
Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.
Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.
Hello I am fairly new to openCL and have encountered a problem when trying to index my multidimensional arrays. From what I understand it is not possible to store a multidimensional array in the global memory, but it is possible in the local memory. However when I try to access my 2D local array it always comes back as 0.I had a look at my gpu at http://www.notebookcheck.net/NVIDIA-GeForce-GT-635M.66964.0.html and found out that I had 0 shared memory, could this be the reason? What other limitations will 0 shared memory place on my programming experience?
I've posted a small simple program of the problem that I'm facing.
The input is = [1,2,3,4] and I would like to store this in my 2D array.
__kernel void kernel(__global float *input, __global float *output)
{//the input is [1,2,3,4];
int size=2;//2by2 matrix
int idx = get_global_id(0);
int idy = get_global_id(1);
__local float 2Darray[2][2];
2Darray[idx][idy]=input[idx*size+idy];
output[0]=2Darray[1][1];//this always returns 0, but should return 4 on the first output no?
}
__local float 2Darray[1][1];
is 1 element wide, 1 element high.
2Darray[1][1]
is second row and second column which doesnt exist.
Even if it lets you have local memory without an error, it spills to global memory and gets as slow as vram bandwidth(if it doesnt fit local mem space).
Race condition:
output[0]=2Darray[1][1];
each core trying to write to same(0) index. Add
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
if(idx==0 && idy==0)
before it so only 1 thread writes to it. But this still needs synchronization instruction before that.
What is the difference/use for these 2 types? I have a basic understanding regarding pointers but I just can't wrap my head around this.
uint8_t* address_at_eeprom_location = (uint8_t*)10;
This line found in an Arduino example makes me feel so dumb. :)
So basically this is a double pointer?
The uint_t is the unsigned integer, this is the data stored directly in the memory. The uint_t * is the pointer to the memory in which the number is stored. The (uint_t*) is cast of the 10 - (literal which is translated to a number in the memory so the binary representation of the number ten) to the pointer type. This will create the storage to store the 10, and than will use its address and store it in the address_at_eeprom_location variable.
uint8_t is an unsigned 8 bit integer
uint8_t* is a pointer to an 8 bit integer in ram
(uint8_t*)10 is a pointer to an uint8_t at the address 10 in the ram
So basically this line saves the address of the location for an uint_8 in address_at_eeprom_location by setting it to 10. Most likely later in the code this address is used to write/read an actual uint8_t value to/from there.
Instead of a single value this can also be used as an starting point for an array later in the code:
uint8_t x = address_at_eeprom_location[3]
This would read the 3rd uint8_t starting from address 10 (so at address 13) in ram into the variable x