CUDA 4.0 using pointers within kernels - error - pointers

my question is as follows:
I wish to use a kernel in two ways.
I use an array d_array that has been copied over using cudaMemcpy, i.e. through
cutilSafeCall(cudaMemcpy(d_array, array, 100*sizeof(double),
cudaMemcpyHostToDevice));
Or
I input a double mydouble directly i.e. double mydouble = 3;
If I input the array I simply use (which works fine):
kernel<<<1, 100>>>(d_array, 100, output);
If I input a double I use (which doesn't work fine!!!!):
kernel<<<1, 100>>>(&mydouble, 1, output);
My kernel is listed below:
___global___ void kernel(double * d_array, int size_d_array, double * output)
{
double a;
if (size_d_array == 100)
{output[threadIdx.x] = d_array[threadIdx.x];}
else
{output a[threadIdx.x] = d_array[0];}
}

double aDouble = 3;
double *myDouble = &double;
If you do the above in host code, then myDouble is a pointer to host memory. That is why you can't pass it directly to a device kernel (a pointer is a pointer, whether it points to an array or a scalar value!).
However in CUDA 4.0 you can call cudaHostRegister on the host pointer and if your system supports unified virtual addressing, then you can pass it to the kernel. If it does not, then you can call cudaHostRegister with appropriate flags and then cudaHostGetDevicePointer to get a pointer you can pass to the device kernel. See the CUDA documentation on

Related

declaring and defining pointer vetors of vectors in OpenCL Kernel

I have a variable which is vector of vector, And in c++, I am easily able to define and declare it but in OpenCL Kernel, I am facing the issues. Here is an example of what I am trying to do.
std::vector<vector <double>> filter;
for (int m= 0;m<3;m++)
{
const auto& w = filters[m];
-------sum operation using w
}
Now Here, I can easily referencing the values of filters[m] in w, but I am not able to do this OpenCl kernel file. Here is what I have tried,but it is giving me wrong output.
In host code:-
filter_dev = cl::Buffer(context,CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR,filter_size,(void*)&filters,&err);
filter_dev_buff = cl::Buffer(context,CL_MEM_READ_WRITE,filter_size,NULL,&err);
kernel.setArg(0, filter_dev);
kernel.setArg(1, filter_dev_buff);
In kernel code:
__kernel void forward_shrink(__global double* filters,__global double* weight)
{
int i = get_global_id[0]; // I have tried to use indiviadual values of i in filters j, just to check the output, but is not giving the same values as in serial c++ implementation
weight = &filters[i];
------ sum operations using weight
}
Can anyone help me? Where I am wrong or what can be the solution?
You are doing multiple things wrong with your vectors.
First of all (void*)&filters doesn't do what you want it to do. &filters doesn't return a pointer to the beginning of the actual data. For that you'll have to use filters.data().
Second you can't use an array of arrays in OpenCL (or vector of vectors even less). You'll have to flatten the array yourself to a 1D array before you pass it to a OpenCL kernel.

When I invoke an asynchronous CUDA kernel, how are its arguments copied?

Say I want to invoke a CUDA kernel, like this:
struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);
Assume that stream was previously created using a call to cudaStreamCreate(), so the above will execute asynchronously. I'm concerned about the required lifetime of arg.
Are the arguments to the kernel copied synchronously when I invoke it (so it would be safe for arg to go out of scope immediately), or are they copied asynchronously (so I need to ensure that it stays alive until the kernel runs)?
Arguments are copied synchronously at launch. The API exposes a call stack onto which execution parameters and function arguments are pushed in order, then a call finalises those arguments into a CUDA kernel launch on the drivers internal streams/command queues.
This process isn't documented, but as of CUDA 7.5, a runtime API kernel launch like this:
dot_product<<<1,n>>>(n, d_a, d_b);
becomes this:
(cudaConfigureCall(1, n)) ? (void)0 : (dot_product)(n, d_a, d_b);
where the host stub function dot_product is expanded into this:
void __device_stub__Z11dot_productiPfS_(int __par0, float *__par1, float *__par2)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par1, sizeof(__par1), (size_t)8UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par2, sizeof(__par2), (size_t)16UL) != cudaSuccess) return;
{
volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(int, float *, float *))dot_product));
(void)cudaLaunch(((char *)((void ( *)(int, float *, float *))dot_product)));
};
}
void dot_product( int __cuda_0,float *__cuda_1,float *__cuda_2)
{
__device_stub__Z11dot_productiPfS_( __cuda_0,__cuda_1,__cuda_2);
}
cudaSetupArgument is the API call which is pushing arguments onto the call stack. Interestingly, this is actually deprecated in the API documentation for CUDA 7.5, even though the compiler is using it. I would, therefore, expect this to change in the future, but the idea will be the same.
The parameters of the kernel call are copied prior to execution, so the scope schould be of no concern. But please note that the size of all kernel parameters cannot exceed a maximum size in bytes. If you want larger structs or blobs of data you need to allocate the used memory on the device using cudaMalloc, then copy the content of the host struct to the device struct using cudaMemcpy and call the kernel with a pointer to the new device struct.
Your code would look something like this:
struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here
cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);

OpenCL kernels arguments ambiguity

I have two functions:
void sum1(short * a, short * b, short * res, int size);
void sum2(float * a, float * b, float * res, int size);
and I have a single generic kernel
__kernel void sum(__global const T * a, __global const T * b, __global T * res, int size)
{
int x = get_global_id(0);
if (x < size) res[x] = a[x] + b[x];
}
is it safely to invoke this generic kernel from functions presented above with compile options -D T=short and -D T=float respectively? Do I need to use alignment or does OpenCL automatically align kernel arguments in this case to 2 and 4 bytes respectively?
In general, when I am passing cl_mem object to a kernel OpenCL does not know about a data type that stored in this cl_mem object and I could understand how OpenCL "transforms" cl_mem object to appropriate pointer in kernel arg.. I need help with that
1- Yes. It is safe to use directly -D T=short or float at compile time. Since it will generate 2 proper kernels.
2- OpenCL (and other programming languages with pointers) understand that the pointer you are passing has a type. And they stick to this type when addressing the memory.
At least in C, this is not a problem since automatic pointer casting is not allowed. And the programmer gets an error if the pointer doesn't match.
However in OpenCL, the memory zones of a buffer are considered generic or void pointers. When you assign them to a kernel, the cast is implicit in the assignment. But this doesn't mean it is correct !
For example. If you create a float buffer, fill it with floats, use it as an argument to short kernel. The result will be wrong, since the kernel will interpret the buffer wrong. However if you do it wrong by passing a short array to a float kernel, the result will be a SEG_FAULT.

CUDA streams, texture binding and async memcpy

Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).
So now I'm facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.
Consider the following kernel
texture<...> mytexture;
__global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}
The kernel is launched in multiple streams
extern void *sourcedata;
#define N_CUDA_STREAMS ...
cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];
for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
cudaStreamCreate(stream[k_stream]);
cudaMalloc(&d_pOut[k_stream], ...);
cudaMalloc(&d_texData[k_stream], ...);
}
/* ... */
for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
int const k_stream = i_datablock % N_CUDA_STREAMS;
cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);
cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);
mykernel<<<..., stream[k_stream]>>>(d_pOut);
}
Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams' kernels access that texture? cudaBindStream doesn't take a stream parameter, so I'm worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I'll divert their accesses to the other data.
The CUDA documentation doesn't tell anything about this. If have to to disentangle this to allow concurrent access, it seems I'd have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.
Unfortunately CUDA doesn't allow to put arrays of textures on the device side, i.e. the following does not work:
texture<...> texarray[N_CUDA_STREAMS];
Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).
Indeed you cannot unbind the texture while still using it in a different stream.
Since the number of streams doesn't need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:
texture<float, 1, cudaReadModeElementType> mytexture1;
texture<float, 1, cudaReadModeElementType> mytexture2;
template<int TexSel> __device__ float myTex1Dfetch(int x);
template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }
template<int TexSel> __global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
}
int main(void)
{
float *out_d[2];
// ...
mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);
// ...
}

CUDA device pointer manipulation

I've used:
float *devptr;
//...
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
in CUDA C to allocate and populate an array.
Now I'm trying to run a cuda kernel, e.g.:
__global__ void kernelname(float *ptr)
{
//...
}
in that array but with an offset value.
In C/C++ it would be someting like this:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
However, this doesn't seem to work.
Is there a way to do this without sending the offset value to the kernel in a separate argument and use that offset in the kernel code?
Any ideas on how to do this?
Pointer arithmetic does work just fine in CUDA. You can add an offset to a CUDA pointer in host code and it will work correctly (remembering the offset isn't a byte offset, it is a plain word or element offset).
EDIT: A simple working example:
#include <cstdio>
int main(void)
{
const int na = 5, nb = 4;
float a[na] = { 1.2, 3.4, 5.6, 7.8, 9.0 };
float *_a, b[nb];
size_t sza = size_t(na) * sizeof(float);
size_t szb = size_t(nb) * sizeof(float);
cudaFree(0);
cudaMalloc((void **)&_a, sza );
cudaMemcpy( _a, a, sza, cudaMemcpyHostToDevice);
cudaMemcpy( b, _a+1, szb, cudaMemcpyDeviceToHost);
for(int i=0; i<nb; i++)
printf("%d %f\n", i, b[i]);
cudaThreadExit();
}
Here, you can see a word/element offset has been applied to the device pointer in the second cudaMemcpy call to start the copy from the second word, not the first.
Pointer arithmetic does work on host side code, it's used fairly often in the example code provided by nvidia.
"Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree."
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4KialMz00
And from the performance primitives (npp) documentation, a perfect example of pointer arithmetic.
"4.5.1 Select-Channel Source-Image Pointer
This is a pointer to the channel-of-interest within the first pixel of the source image. E.g. if pSrc is the
pointer to the first pixel inside the ROI of a three channel image. Using the appropriate select-channel copy
primitive one could copy the second channel of this source image into the first channel of a destination
image given by pDst by offsetting the pointer by one:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*Note: this works without multiplying by the number of bytes per data element because the compiler is aware of the data type of the pointer, and calculates the address accordingly.
In C and C++, pointer arithmetic can be accomplished as above or by the notation &ptr[offset] (to return device memory address of data instead of value, value will not work on device memory from host side code). When using either notation the size of the data type is automatically handled, and the offset is specified as a number of data elements rather than bytes.

Resources