First of all... I am no expert in OpenCL.
I am using 2 kernels. The output of the first kernel is image2d_t but the input of the second kernel is " __global const uchar* source".
__kernel void firstKernel(__read_only image2d_t input, __write_only image2d_t output)
{...}
__kernel void secondKernel( __global const uchar* source,...)
{...}
How to use the second kernel with that input?
I think you should be able to do it using the cl_khr_image2d_from_buffer extension assuming your implementation supports it.
My understanding is you first create the buffer (bearing in mind alignment etc requirements of images), then create the image from the buffer using the extension's functionality. Usual rules about read and write access and synchronisation will apply.
Related
Reading the OpenCL documentation, I know that the access qualifiers read_only and write_only are intended for image memory.
However, I'm noticing some people use these qualifiers on regular, non-image-memory, parameters, e.g.:
void foo(unsigned n, __global read_only int* data)
Note the lack of const.
My questions:
Does read_only imply, in particular, const?
Does read_only imply anything other than const? Something else that an OpenCL compiler can utilize?
... or is it just meaningless for non-image-memory, and ignored?
I wrote a function in OpenCL:
void sort(int* array, int size)
{
}
and I need to call the function once over a __private array and once over a __global array. Apparently, it's not allowed in OpenCL to specify multiple address spaces for a type. Therefore, I should duplicate the declaration of function, while they have exactly the same body:
void sort_g(__global int* array, int size)
{
}
void sort_p(__private int* array, int size)
{
}
This is very inefficient for maintaining the code and I am wondering if there is a better way to manage multiple address spaces in OpenCL or not?
P.S.: I don't see why OpenCL doesn't allow multiple address spaces for a type. Compiler could generate multiple instances of the function (one per address space) and use them appropriately once they're called in the kernel.
For OpenCL < 2.0, this is how the language is designed and there is no getting around it, regrettably.
For OpenCL >= 2.0, with the introduction of the generic address space, your first piece of code works as you would expect.
In short, upgrading to 2.0 would solve your problem (and bring in other niceties), otherwise you're out of luck (you could perhaps wrap your function in a macro, but ew, macros).
I am running multiple iterations of an OpenCL program, and after a few, I get the following error:
ERROR: Read Result (-5)
CL_OUT_OF_RESOURCES
when running this command
err = clEnqueueReadBuffer( commands, d_c, CL_TRUE, 0,
sizeof(char) * result_size,
result_buffer, 0, NULL, NULL );
checkErr(err,"Read Result");
The kernel allocates 3 global memory buffers, which I release
clReleaseMemObject(d_a);
clReleaseMemObject(d_b)
clReleaseMemObject(d_c);
clReleaseKernel(ko_smat);
But I also allocate local and private memory, the private memory is allocated in the kernel (char tmp_array) and local memory.
My kernel has definition:
__kernel void mmul(
__global char* C,
__global char* A,
__global char* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
The local memory is created in the kernel via
clSetKernelArg(ko_smat,6, sizeof(char) * local_mem_size, NULL);
I'm guessing that the out of memory error is caused by me failing to free either the private memory or the local memory, but I don't know how to?
Since I don't have enough reputation to comment, I have to use an answer.
To properly address your problem it will be helpful, if you post a working example of your code.
How much local memory do you actually allocate? It might very well possible that you allocate more than your device is capable of. If your "local_mem_size" variable is not fixed but calculated dynamically, find out the worst case scenario.
You can query how much local memory your device can provide, just call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE.
As DarkZeros already mentioned, CL_OUT_OF_RESOURCES is an error that occurs on NVIDIA GPUs when addressing memory out of range. This can happen for both local and global memory.
This question is in my mind for many days but i got cleared till now
if i try to send a signal of a structure say
struct radarStruct
{
unsigned char *Data;
unsigned int rate;
unsigned int timeStamp;
float timeSec;
};
shall i emit the signal like
signals:
void radarGeneratorData(const radarStruct &);
or
signals:
void radarGeneratorData(radarStruct *);
More: what about the unsigned char *Data , whether the signal will make a deep copy of it ??
similar for a unsigned char *data how i can send the signal .
Friends please help me clear this , the best way to send a structure through signals & slot mechanism ..
Thnks in advance
This mainly depends on your requirement.
If you want to send the structure over to another object and "share" the structure, you would need to send the structure as a pointer so that changes are reflected at source. If not then you would want to send it as a const ref.
For either methods remember you need to Q_DECLARE_METATYPE(YourStructType) before you can use the structure in signal / slot arguments.
As for when a deep copy occurs / when a routine call back equivalent process happens, you can have a read through
Signaling failure in qt slots
Single Thread communication and Cross-Threaded communication differ within themselves and the output would again depend on your usage.
I am developing a Windows 64-bit application that will manage concurrent execution of different CUDA-algorithms on several GPUs.
My design requires a way of passing pointers to device memory
around c++ code. (E.g. remember them as members in my c++ objects).
I know that it is impossible to declare class members with __device__ qualifiers.
However I couldn't find a definite answer whether assigning __device__ pointer to a normal C pointer and then using the latter works. In other words: Is the following code valid?
__device__ float *ptr;
cudaMalloc(&ptr, size);
float *ptr2 = ptr
some_kernel<<<1,1>>>(ptr2);
For me it compiled and behaved correctly but I would like to know whether it is guaranteed to be correct.
No, that code isn't strictly valid. While it might work on the host side (more or less by accident), if you tried to dereference ptr directly from device code, you would find it would have an invalid value.
The correct way to do what your code implies would be like this:
__device__ float *ptr;
__global__ void some_kernel()
{
float val = ptr[threadIdx.x];
....
}
float *ptr2;
cudaMalloc(&ptr2, size);
cudaMemcpyToSymbol("ptr", ptr2, sizeof(float *));
some_kernel<<<1,1>>>();
for CUDA 4.x or newer, change the cudaMemcpyToSymbol to:
cudaMemcpyToSymbol(ptr, ptr2, sizeof(float *));
If the static device symbol ptr is really superfluous, you can just to something like this:
float *ptr2;
cudaMalloc(&ptr2, size);
some_kernel<<<1,1>>>(ptr2);
But I suspect that what you are probably looking for is something like the thrust library device_ptr class, which is a nice abstraction wrapping the naked device pointer and makes it absolutely clear in code what is in device memory and what is in host memory.