OpenCL reduction from private to local then global? - opencl

The following kernel computes an acoustic pressure field, with each thread computing it's own private instance of the pressure vector, which then needs to be summed down into global memory.
I'm pretty sure the code which computes the pressurevector is correct, but I'm still having trouble making this produce the expected result.
int gid = get_global_id(0);
int lid = get_local_id(0);
int nGroups = get_num_groups(0);
int groupSize = get_local_size(0);
int groupID = get_group_id(0);
/* Each workitem gets private storage for the pressure field.
* The private instances are then summed into local storage at the end.*/
private float2 pressure[HYD_DIM_TOTAL];
local float2 pressure_local[HYD_DIM_TOTAL];
/* Code which computes value of 'pressure' */
//wait for all workgroups to finish accessing any memory
barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
/// sum all results in a workgroup into local buffer:
for(i=0; i<groupSize; i++){
//each thread sums its own private instance into the local buffer
if (i == lid){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_local[iHyd] += pressure[iHyd];
}
}
//make sure all threads in workgroup get updated values of the local buffer
barrier(CLK_LOCAL_MEM_FENCE);
}
/// copy all the results into global storage
//1st thread in each workgroup writes the group's local buffer to global memory
if(lid == 0){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[groupID +nGroups*iHyd] = pressure_local[iHyd];
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
/// sum the various instances in global memory into a single one
// 1st thread sums global instances
if(gid == 0){
for(iGroup=1; iGroup<nGroups; iGroup++){
//we only need to sum the results from the 1st group onward
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[iHyd] += pressure_global[iGroup*HYD_DIM_TOTAL +iHyd];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
}
Some notes on data dimensions:
The total number of threads will vary between 100 and 2000, but may on occasion lie outside this interval.
groupSizewill depend on hardware but I'm currently using values between 1(cpu) and 32(gpu).
HYD_DIM_TOTAL is known at compile time and varies between 4 and 32 (will generally, but not necessarily, be a power of 2).
Is there anything blatantly wrong with this reduction code?
PS: I run this on an i7 3930k with AMD APP SDK 2.8 and on an NVIDIA GTX580.

I notice two issues here, one big, one smaller:
This code suggests that you have a misunderstanding of what a barrier does. A barrier never synchronizes across multiple workgroups. It only synchronizes within a workgroup. The CLK_GLOBAL_MEM_FENCE makes it look like it is global synchronization, but it really isn't. That flag just fences all of the current work item's accesses to global memory. So outstanding writes will be globally observable after a barrier with this flag. But it does not change the barrier's synchronization behavior, which is only at the scope of a workgroup. There is no global synchronization in OpenCL, beyond launching another NDRange or Task.
The first for loop causes multiple work items to overwrite each others' computation. The indexing of pressure_local with iHyd will be done by each work item with the same iHyd. This will produce undefined results.
Hope this helps.

Related

Can I have boolean buffer in OpenCL and change its value during kernel execution, example to break while loop

I want to do some experiments in OpenCL and I want to know possibility to change states during kernel execution from host code using buffer.
I attempted to alter the state of a while loop in the kernel code by modifying the buffer value from within the host code, however the execution is hung.
void my_kernel(
__global bool *in,
__global int *out)
{
int i = get_global_id(0);
while(1) {
if(1 == *in) {
printf("while loop is finished");
break;
}
}
printf("out[0] = %d\n", out[0]);
}
I call second time the function clEnqueueWriteBuffer() to change state of input value.
input[0] = 1;
err = clEnqueueWriteBuffer(commands, input_buffer,
CL_TRUE, 0, sizeof(int), (void*)input,
0, NULL,NULL);
At least for OpenCL 1.x, this is not permitted, and any behaviour you may observe in one implementation cannot be relied upon.
See the NOTE in the OpenCL 1.2 specification, section 5.2.2, Reading, Writing and Copying Buffer Objects:
Calling clEnqueueWriteBuffer to update the latest bits in a region of the buffer object with the ptr argument value set to host_ptr + offset, where host_ptr is a pointer to the memory region specified when the buffer object being written is created with CL_MEM_USE_HOST_PTR, must meet the following requirements in order to avoid undefined behavior:
The host memory region given by (host_ptr + offset, cb) contains the latest bits when the enqueued write command begins execution.
The buffer object or memory objects created from this buffer object are not mapped.
The buffer object or memory objects created from this buffer object are not used by any command-queue until the write command has finished execution.
The final condition is not met by your code, therefore its behaviour is undefined.
I am not certain if the situation is different with OpenCL 2.x's Shared Virtual Memory (SVM) feature, as I have no practical experience using it, perhaps someone else can contribute an answer for that.

Understanding the method for OpenCL reduction on float

Following this link, I try to understand the operating of kernel code (there are 2 versions of this kernel code, one with volatile local float *source and the other with volatile global float *source, i.e local and global versions). Below I take local version :
float sum=0;
void atomic_add_local(volatile local float *source, const float operand) {
union {
unsigned int intVal;
float floatVal;
} newVal;
union {
unsigned int intVal;
float floatVal;
} prevVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = prevVal.floatVal + operand;
} while (atomic_cmpxchg((volatile local unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}
If I understand well, each work-item shares the access to source variable thanks to the qualifier "volatile", doesn't it?
Afterwards, if I take a work-item, the code will add operand value to newVal.floatVal variable. Then, after this operation, I call atomic_cmpxchg function which check if previous assignment (preVal.floatVal = *source; and newVal.floatVal = prevVal.floatVal + operand; ) has been done, i.e by comparing the value stored at address source with the preVal.intVal.
During this atomic operation (which is not uninterruptible by definition), as value stored at source is different from prevVal.intVal, the new value stored at source is newVal.intVal, which is actually a float (because it is coded on 4 bytes like integer).
Can we say that each work-item has a mutex access (I mean a locked access) to value located at source address.
But for each work-item thread, is there only one iteration into the while loop?
I think there will be one iteration because the comparison "*source== prevVal.int ? newVal.intVal : newVal.intVal" will always assign newVal.intVal value to value stored at source address, won't it?
I have not understood all the subtleties of this trick for this kernel code.
Update
Sorry, I almost understand all the subtleties, especially in the while loop :
First case : for a given single thread, before the call of atomic_cmpxchg, if prevVal.floatVal is still equal to *source, then atomic_cmpxchg will change the value contained in source pointer and return the value contained in old pointer, which is equal to prevVal.intVal, so we break from the while loop.
Second case : If between the prevVal.floatVal = *source; instruction and the call of atomic_cmpxchg, the value *source has changed (by another thread ??) then atomic_cmpxchg returns old value which is no more equal to prevVal.floatVal, so the condition into while loop is true and we stay in this loop until previous condition isn't checked any more.
Is my interpretation correct?
If I understand well, each work-item shares the access to source variable thanks to the qualifier "volatile", doesn't it?
volatile is a keyword of the C language that prevents the compiler from optimizing accesses to a specific location in memory (in other words, force a load/store at each read/write of said memory location). It has no impact on the ownership of the underlying storage. Here, it is used to force the compiler to re-read source from memory at each loop iteration (otherwise the compiler would be allowed to move that load outside the loop, which breaks the algorithm).
do {
prevVal.floatVal = *source; // Force read, prevent hoisting outside loop.
newVal.floatVal = prevVal.floatVal + operand;
} while(atomic_cmpxchg((volatile local unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal)
After removing qualifiers (for simplicity) and renaming parameters, the signature of atomic_cmpxchg is the following:
int atomic_cmpxchg(int *ptr, int expected, int new)
What it does is:
atomically {
int old = *ptr;
if (old == expected) {
*ptr = new;
}
return old;
}
To summarize, each thread, individually, does:
Load current value of *source from memory into preVal.floatVal
Compute desired value of *source in newVal.floatVal
Execute the atomic compare-exchange described above (using the type-punned values)
If the result of atomic_cmpxchg == newVal.intVal, it means the compare-exchange was successful, break. Otherwise, the exchange didn't happen, go to 1 and try again.
The above loop eventually terminates, because eventually, each thread succeeds in doing their atomic_cmpxchg.
Can we say that each work-item has a mutex access (I mean a locked access) to value located at source address.
Mutexes are locks, while this is a lock-free algorithm. OpenCL can simulate mutexes with spinlocks (also implemented with atomics) but this is not one.

segmentation fault when using shared memory created by open_shm on Xeon Phi

I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
for(i=1;i<ntasks;i++)recvbuff[i]=recvbuff[i-1]+buffsize;
}
else{
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
}
for(i=0;i<buffsize;i++){
sendbuff[i]=1;
MPI_Barrier(MPI_COMM_WORLD);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
{
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
MPI_Comm_rank(comm,&rank);
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
MPI_Barrier(MPI_COMM_WORLD);
return shared;
}
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
{
int i=0;
int k,j;
j=rank*sendcount;
for(i=0;i<sendcount;i++)
{
sharedRegion[j] = sendbuff[i];
j++;
}
if( rank == 0)
for(k=0;k<size;k++)
for(i=0;i<sendcount;i++)
{
j=0;
recvbuff[k][i] = sharedRegion[j];
j++;
}
}
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.

Effect of using page-able memory for asynchronous memory copy?

In CUDA C Best Practices Guide Version 5.0, Section 6.1.2, it is written that:
In contrast with cudaMemcpy(), the asynchronous transfer version
requires pinned host memory (see Pinned Memory), and it contains an
additional argument, a stream ID.
It means the cudaMemcpyAsync function should fail if I use simple memory.
But this is not what happened.
Just for testing purpose, I tried the following program:
Kernel:
__global__ void kernel_increment(float* src, float* dst, int n)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid<n)
dst[tid] = src[tid] + 1.0f;
}
Main:
int main()
{
float *hPtr1, *hPtr2, *dPtr1, *dPtr2;
const int n = 1000;
size_t bytes = n * sizeof(float);
cudaStream_t str1, str2;
hPtr1 = new float[n];
hPtr2 = new float[n];
for(int i=0; i<n; i++)
hPtr1[i] = static_cast<float>(i);
cudaMalloc<float>(&dPtr1,bytes);
cudaMalloc<float>(&dPtr2,bytes);
dim3 block(16);
dim3 grid((n + block.x - 1)/block.x);
cudaStreamCreate(&str1);
cudaStreamCreate(&str2);
cudaMemcpyAsync(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice,str1);
kernel_increment<<<grid,block,0,str2>>>(dPtr1,dPtr2,n);
cudaMemcpyAsync(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost,str1);
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaDeviceSynchronize();
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaStreamDestroy(str1);
cudaStreamDestroy(str2);
cudaFree(dPtr1);
cudaFree(dPtr2);
for(int i=0; i<n; i++)
std::cout<<hPtr2[i]<<std::endl;
delete[] hPtr1;
delete[] hPtr2;
return 0;
}
The program gave correct output. The array incremented successfully.
How did cudaMemcpyAsync execute without page locked memory?
Am I missing something here?
cudaMemcpyAsync is fundamentally an asynchronous version of cudaMemcpy. This means that it doesn't block the calling host thread when the copy call is issued. That is the basic behaviour of the call.
Optionally, if the call is launched into the non default stream, and if the host memory is a pinned allocation, and the device has a free DMA copy engine, the copy operation can happen while the GPU simultaneously performs another operation: either kernel execution or another copy (in the case of a GPU with two DMA copy engines). If any of these conditions are not satisfied, the operation on the GPU is functionally identical to a standard cudaMemcpy call, ie. it serialises operations on the GPU, and no simultaneous copy-kernel execution or simultaneous multiple copies can occur. The only difference is that the operation doesn't block the calling host thread.
In your example code, the host source and destination memory are not pinned. So the memory transfer cannot overlap with kernel execution (ie. they serialise operations on the GPU). The calls are still asynchronous on the host. So what you have is functionally equivalent to:
cudaMemcpy(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice);
kernel_increment<<<grid,block>>>(dPtr1,dPtr2,n);
cudaMemcpy(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost);
with the exception that all the calls are asynchronous on the host, so the host thread blocks at the cudaDeviceSynchronize() call rather than at each of the memory transfer calls.
This is absolutely expected behaviour.

Asynchronous execution of commands from two command queues in OpenCL

I am trying to work out an application that can utilize both CPU and GPU at the same time by OpenCL. Specifically, I have two kernels, one for CPU executing, and one for GPU. CPU kernel will change the content of one buffer, and GPU will do other things when GPU detects that the buffer has been changed by CPU.
__kernel void cpuKernel(__global uint * dst1,const uint size)
{
uint tid = get_global_id(0);
uint size = get_global_size(0);
while(tid < size)
{
atomic_xchg(&dst1[tid],10);
tid += size;
}
}
__kernel void gpuKernel(__global uint * dst1, __global uint * dst2, const uint size)
{
uint tid = get_global_id(0);
uint size = get_global_size(0);
while(tid < vectorSize)
{
while(dst1[vectorOffset + tid] != 10)
;
dst2[vectorOffset + tid] = dst1[vectorOffset+tid];
tid += size;
}
}
As shown above, cpuKernel will change each element of dst1 buffer to 10, correspondingly, after GPU detect such changes, it will assign the element value (10) to the same place of another buffer dst2. cpuKernel is queued in command1 which is associated with CPU device, and gpuKernel is queued in command2 which is associated with GPU device, two command queues have been set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE flag.
Then I make two cases:
case 1:
clEnqueueNDRangeKernel(command2,gpuKernel);
clEnqueueNDRangeKernel(command1,cpuKernel);
clfinish(command1);
clfinish(command2);
case 2:
clEnqueueNDRangeKernel(command1,cpuKernel);
clfinish(command1);
clEnqueueNDRangeKernel(command2,gpuKernel);
clfinish(command2);
But the results show that the time consumed in two cases are nearly the same, but I expect there will be some overlapping in case 1, but there is not. Can anyone help me? Thanks!
Or, can anyone help to explain how to implement two kernels running on two devices asynchronously in OpenCL?
You are asking too much. As you have probably noticed, buffer objects are relative to a context, while command queues are related to devices.
If a kernel operates on a buffer object, the corresponding data must be on this device. If you do not transfer it explicitely with clEnqueueWriteBuffer(), OpenCL will do that for you.
Hence, if you modify a buffer object with a kernel on one device (for example the CPU), and just after on another device (for example the GPU), the OpenCL driver will wait for the first kernel to finish, transfer the data, and then run the second kernel.

Resources