Can I know OpenCL's function call stack size?
I'm using NVIDIA OpenCL1.2 in Ubuntu. (NVIDIA CC=5.2)
And I found some unexpected result in my testcode.
When some function invoked 64 times, the next invoked function seems like can not access the arguments.
In my thought, call stack overflow makes this progblem.
Below is my example code and result:
void testfunc(int count, int run)
{
if(run==0) return;
count++;
printf("count=%d run=%d\n", count, run);
run = run - 1;
testfunc(count, run);
}
__kernel void hello(__global int * in_img, __global int * out_img)
{
int run;
int count=0;
run = 70;
testfunc(count, run);
}
And this is the result :
count=1 run=70
count=2 run=69
count=3 run=68
count=4 run=67
count=5 run=66
count=6 run=65
count=7 run=64
.....
count=59 run=12
count=60 run=11
count=61 run=10
count=62 run=9
count=63 run=8
count=64 run=7
count=0 run=0 // <--- Why count and run values are ZERO?
count=0 run=0
count=0 run=0
count=0 run=0
count=0 run=0
count=0 run=0
Recursion is not supported in OpenCL 1.x. From AMD's Introduction to OpenCL:
Key restrictions in the OpenCL C language are:
Function pointers are not supported.
Bit-fields are not supported.
Variable length arrays are not supported.
Recursion is not supported.
No C99 standard headers such as ctypes.h,errno.h,stdlib.h,etc. can be included.
AFAIK not all implementations have a call-stack like feature at all. In these cases, and possibly in your case, any function calls are inlined in the calling scope.
Related
Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
P.S. I apologize for my English, which is not my native language :)
I have the following loop executed in my OpenCl kernel:
__kernel void kernelA(/* many parameters */)
{
/* Prefetching code and other stuff
* ...
* ...
*/
float2 valueA = 0.0f;
#pragma unroll //<----- line X
for(unsigned int i = 0; i < MAX_A; i++) // MAX_A > 0
{
#pragma unroll
for(unsigned int j = 0; j < MAX_B; j++) // MAX_B > 0
valueA += arrayA[(i * MAX_A) + j];
}
/*
* Code that uses the result saved to valueA
*/
}
As can be seen clearly the loop shall summarize values contained in arrayA. Now I wanted to try the #pragma unroll to see whether there is any performance difference between looped and unrolled execution.
But when I compile the kernel, the compiler notes LOOP UNROLL: pragma unroll (line X) ignored because this loop is dead and deleted. I don't understand that information, because the code in the loop is surely executed. MAX_A and MAX_B are definitely greater than zero and the the sum saved to valueA is also used after the loop.
I have the same structure somewhere else in the code and also this position is marked by the upper note.
The compiler I use is the AMD OpenCL C compiler delivered by the APP SDK.
The comment by #DarkZeroes is the solution of this question. There was no instruction to put the result into an output array of the kernel, so the code above and everything what depended on that was optimized away by the compiler.
Say I want to invoke a CUDA kernel, like this:
struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);
Assume that stream was previously created using a call to cudaStreamCreate(), so the above will execute asynchronously. I'm concerned about the required lifetime of arg.
Are the arguments to the kernel copied synchronously when I invoke it (so it would be safe for arg to go out of scope immediately), or are they copied asynchronously (so I need to ensure that it stays alive until the kernel runs)?
Arguments are copied synchronously at launch. The API exposes a call stack onto which execution parameters and function arguments are pushed in order, then a call finalises those arguments into a CUDA kernel launch on the drivers internal streams/command queues.
This process isn't documented, but as of CUDA 7.5, a runtime API kernel launch like this:
dot_product<<<1,n>>>(n, d_a, d_b);
becomes this:
(cudaConfigureCall(1, n)) ? (void)0 : (dot_product)(n, d_a, d_b);
where the host stub function dot_product is expanded into this:
void __device_stub__Z11dot_productiPfS_(int __par0, float *__par1, float *__par2)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par1, sizeof(__par1), (size_t)8UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par2, sizeof(__par2), (size_t)16UL) != cudaSuccess) return;
{
volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(int, float *, float *))dot_product));
(void)cudaLaunch(((char *)((void ( *)(int, float *, float *))dot_product)));
};
}
void dot_product( int __cuda_0,float *__cuda_1,float *__cuda_2)
{
__device_stub__Z11dot_productiPfS_( __cuda_0,__cuda_1,__cuda_2);
}
cudaSetupArgument is the API call which is pushing arguments onto the call stack. Interestingly, this is actually deprecated in the API documentation for CUDA 7.5, even though the compiler is using it. I would, therefore, expect this to change in the future, but the idea will be the same.
The parameters of the kernel call are copied prior to execution, so the scope schould be of no concern. But please note that the size of all kernel parameters cannot exceed a maximum size in bytes. If you want larger structs or blobs of data you need to allocate the used memory on the device using cudaMalloc, then copy the content of the host struct to the device struct using cudaMemcpy and call the kernel with a pointer to the new device struct.
Your code would look something like this:
struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here
cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);
I've got the following struct:
struct Param
{
double** K_RP;
};
And I wanna perform the following operations on "K_RP" in CUDA
__global__ void Test( struct Param prop)
{
int ix = threadIdx.x;
int iy = threadIdx.y;
prop.K_RP[ix][iy]=2.0;
}
If "prop" has the following form, how should I do my "cudaMalloc" and "cudaMemcpy" operations?
int main( )
{
Param prop;
Param cuda_prop;
prop.K_RP=alloc2D(Imax,Jmax);
//cudaMalloc cuda_prop ?
//cudaMemcpyH2D prop to cuda_prop ?
Test<<< (1,1), (Imax,Jmax)>>> ( cuda_prop);
//cudaMemcpyD2H cuda_prop to prop ?
return (0);
}
Questions like this get asked from time to time. If you search on the cuda tag, you'll find a variety of examples with answers. Here's one example.
In general, dynamically allocated data contained within structures or other objects requires special handling. This question/answer explains why and how to do it for the single pointer (*) case.
Handling double pointers (**) is difficult enough that most people would recommend "flattening" the storage so that it can be handled by reference with a single pointer (*). If you really want to see how the double pointer (**) method works, review this question/answer. It's not trivial.
I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
for(i=1;i<ntasks;i++)recvbuff[i]=recvbuff[i-1]+buffsize;
}
else{
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
}
for(i=0;i<buffsize;i++){
sendbuff[i]=1;
MPI_Barrier(MPI_COMM_WORLD);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
{
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
MPI_Comm_rank(comm,&rank);
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
MPI_Barrier(MPI_COMM_WORLD);
return shared;
}
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
{
int i=0;
int k,j;
j=rank*sendcount;
for(i=0;i<sendcount;i++)
{
sharedRegion[j] = sendbuff[i];
j++;
}
if( rank == 0)
for(k=0;k<size;k++)
for(i=0;i<sendcount;i++)
{
j=0;
recvbuff[k][i] = sharedRegion[j];
j++;
}
}
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.