Do 64bit atomic operations work in openCL on AMD cards? - opencl

The implementation of emulated atomics in openCL following the STREAM blog works nicely for atomic add in 32bit, on CPU as well as NVIDIA and AMD GPUs.
The 64bit equivalent based on the cl_khr_int64_base_atomics extension seems to run properly on (pocl and intel) CPU as well as NVIDIA openCL drivers.
I fail to make 64bit work on AMD GPU cards though -- both on amdgpu-pro and rocm (3.5.0) environments, running on a Radeon VII and a Radeon Instinct MI50, respectively.
The implementation goes as follows:
inline void atomicAdd(volatile __global double *addr, double val)
{
union {
long u64;
double f64;
} next, expected, current;
current.f64 = *addr;
do {
expected.f64 = current.f64;
next.f64 = expected.f64 + val;
current.u64 = atomic_cmpxchg(
(volatile __global long *)addr,
(long) expected.u64,
(long) next.u64);
} while( current.u64 != expected.u64 );
}
In absence of support for atomic operations for double types, the idea is to exploit casting to long as the values just need to be stored (no arithmetics needed). Then one should be able to use long atom_cmpxchg(__global long *p, long cmp, long val) as defined in the khronos manual for int64 base atomics.
The error I receive for both AMD environments points to falling back to 32bit versions, the compiler seems not to recognise the 64bit versions despite the #pragma:
/tmp/comgr-0bdbdc/input/CompileSource:21:17: error: call to 'atomic_cmpxchg' is ambiguous
current.u64 = atomic_cmpxchg(
^~~~~~~~~~~~~~
[...]/centos_pipeline_job_3.5/rocm-rel-3.5/rocm-3.5-30-20200528/7.5/out/centos-7/7/build/amd_comgr/<stdin>:13468:12: note: candidate function
int __ovld atomic_cmpxchg(volatile __global int *p, int cmp, int val);
^
[...]/centos_pipeline_job_3.5/rocm-rel-3.5/rocm-3.5-30-20200528/7.5/out/centos-7/7/build/amd_comgr/<stdin>:13469:21: note: candidate function
unsigned int __ovld atomic_cmpxchg(volatile __global unsigned int *p, unsigned int cmp, unsigned int val);
^
1 error generated.
Error: Failed to compile opencl source (from CL or HIP source to LLVM IR).
I do find the support for cl_khr_int64_base_atomics in both environments on the clinfo extension list though.. Also cl_khr_int64_base is present in the opencl driver binary file.
Does anybody have an idea what might be going wrong here? Using the same implementation for 32bit (int and float instead of long and double) works flawlessly for me...
Thanks for any hints.

For 64-bit, the function is called atom_cmpxchg and not atomic_cmpxchg.

Related

MPI 3 shared memory and cache conflicts

When using MPI 3 shared memory, it occurred to me that writing to adjacent memory positions of a shared memory window simultaneously on different tasks seemingly does not work.
I guessed that MPI ignores possible cache conflicts and now my question is if that is correct and MPI indeed does not care about cache coherency, or if this is a quirk of the implementation, or if there is a completely different explanation to that behaviour?
This is a minimal example where, in Fortran, simultaneously writing to distinct addresses in a shared memory window causes a conflict (tested with intel MPI 2017, 2018, 2019 and GNU OpenMPI 3).
program testAlloc
use mpi
use, intrinsic :: ISO_C_BINDING, only: c_ptr, c_f_pointer
implicit none
integer :: ierr
integer :: window
integer(kind=MPI_Address_kind) :: wsize
type(c_ptr) :: baseptr
integer, pointer :: f_ptr
integer :: comm_rank
call MPI_Init(ierr)
! Each processor allocates one entry
wsize = 1
call MPI_WIN_ALLOCATE_SHARED(wsize,4,MPI_INFO_NULL,MPI_COMM_WORLD,baseptr,window,ierr)
! Convert to a fortran pointer
call c_f_pointer(baseptr, f_ptr)
! Now, assign some value simultaneously
f_ptr = 4
! For output, get the mpi rank
call mpi_comm_rank(MPI_COMM_WORLD, comm_rank, ierr)
! Output the assigned value - only one task reports 4, the others report junk
print *, "On task", comm_rank, "value is", f_ptr
call MPI_Win_free(window, ierr)
call MPI_Finalize(ierr)
end program
Curiously, the same program in C does seem to work as intended, which leads to the question if there is something wrong with the Fortran implementation, or the C program is just lucky (tested with the same MPI libraries)
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
// Allocate a single resource per task
MPI_Aint wsize = 1;
// Do a shared allocation
int *resource;
MPI_Win window;
MPI_Win_allocate_shared(wsize, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &resource, &window);
// For output clarification, get the mpi rank
int comm_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
// Assign some value
*resource = 4;
// Tell us the value - this seems to work
printf("On task %d the value is %d\n",comm_rank,*resource);
MPI_Win_free(&window);
MPI_Finalize();
}
From the MPI 3.1 standard (chapter 11.2.3 page 407)
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
Note the window size is in bytes and not in number of units.
So all you need is to use
wsize = 4
in Fortran (assuming your INTEGER size is indeed 4) and
wsize = sizeof(int);
in C
FWIW
even if the C version seems to give the expected result most of the time, it is also incorrect and I am able to evidence this by running under the program under a debugger.
generally speaking, you might have to declare volatile int * resource; in C to prevent the compiler from performing some optimizations that might impact the behavior of your app (and this is not needed here).

How to pass char pointer into opencl kernel?

I am trying to pass a char pointer into the kernel function of opencl as
char *rp=(char*)malloc(something);
ciErr=clSetKernelArg(ckKernel,0,sizeof(cl_char* ),(char *)&rp)
and my kernel is as
__kernel void subFilter(char *rp)
{
do something
}
When I am running the kernel I am getting
error -48 in clsetkernelargs 1
Also, I tried to modify the kernel as
__kernel void subFilter(__global char *rp)
{
do something
}
I got error as
error -38 in clsetkernelargs 1
which says invalid mem object .
i just want to access the memory location pointed by the rp in the kernel.
Any help would be of great help.
Thnaks,
Piyush
Any arrays and memory objects that you use in an OpenCL kernel needed to be allocated via the OpenCL API (e.g. using clCreateBuffer). This is because the host and device don't always share the same physical memory. A pointer to data that is allocated on the host (via malloc) means absolutely nothing to a discrete GPU for example.
To pass an array of characters to an OpenCL kernel, you should write something along the lines of:
char *h_rp = (char*)malloc(length);
cl_mem d_rp = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, length, h_rp, &err);
err = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), &d_rp)
and declare the argument with the __global (or __constant) qualifier in your kernel. You can then copy the data back to the host with clEnqueueReadBuffer.
If you do know that host and device share the same physical memory, then you can allocate memory that is visible to both host and device by creating a buffer with the CL_MEM_ALLOC_HOST_PTR flag, and using clEnqueueMapMemObject when you wish to access the data from the host. The new shared-virtual-memory (SVM) features of OpenCL 2.0 also improve the way that you can share buffers between host and device on unified-memory architectures.

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

OpenCL kernel fails to compile asking for address space qualifier

The following opencl code fails to compile.
typedef struct {
double d;
double* da;
long* la;
uint ui;
} MyStruct;
__kernel void MyKernel (__global MyStruct* s) {
}
The error message is as follows.
line 11: error: kernel pointer arguments must point to addrSpace global, local, or constant
__kernel void MyKernel (__global MyStruct* s) {
^
As you can see I have clearly qualified the argument with '__global' as the error suggests I should. What am I doing wrong and how can I resolve this error?
Obviously this happens during kernel compilation so I haven't posted my host code here as it doesn't even get further than this.
Thanks.
I think the problem is that you have pointers in your struct, which is not allowed. You cannot point to host memory from your kernel like that, so pointers in kernel argument structs don't make much sense. Variable-sized arrays are backed up in OpenCL by a cl_mem host object, and that counts for one whole argument, so as far as I know, you can only pass variable-sized arrays directly as a kernel argument (and adjust the number of work units accordingly, of course).
You might prefer to put size information in your struct and pull out the arrays as standalone kernel arguments.

How can I perform 64-bit division with a 32-bit divide instruction?

This is (AFAIK) a specific question within this general topic.
Here's the situation:
I have an embedded system (a video game console) based on a 32-bit RISC microcontroller (a variant of NEC's V810). I want to write a fixed-point math library. I read this article, but the accompanying source code is written in 386 assembly, so it's neither directly usable nor easily modifiable.
The V810 has built-in integer multiply/divide, but I want to use the 18.14 format mentioned in the above article. This requires dividing a 64-bit int by a 32-bit int, and the V810 only does (signed or unsigned) 32-bit/32-bit division (which produces a 32-bit quotient and a 32-bit remainder).
So, my question is: how do I simulate a 64-bit/32-bit divide with a 32-bit/32-bit one (to allow for the pre-shifting of the dividend)? Or, to look at the problem from another way, what's the best way to divide an 18.14 fixed-point by another using standard 32-bit arithmetic/logic operations? ("best" meaning fastest, smallest, or both).
Algebra, (V810) assembly, and pseudo-code are all fine. I will be calling the code from C.
Thanks in advance!
EDIT: Somehow I missed this question... However, it will still need some modification to be super-efficient (it has to be faster than the floating-point div provided by the v810, though it may already be...), so feel free to do my work for me in exchange for reputation points ;) (and credit in my library documentation, of course).
GCC has such a routine for many processors, named _divdi3 (usually implemented using a common divmod call). Here's one. Some Unix kernels have an implementation too, e.g. FreeBSD.
If your dividend is unsigned 64 bits, your divisor is unsigned 32 bits, the architecture is i386 (x86), the div assembly instruction can help you with some preparation:
#include <stdint.h>
/* Returns *a % b, and sets *a = *a_old / b; */
uint32_t UInt64DivAndGetMod(uint64_t *a, uint32_t b) {
#ifdef __i386__ /* u64 / u32 division with little i386 machine code. */
uint32_t upper = ((uint32_t*)a)[1], r;
((uint32_t*)a)[1] = 0;
if (upper >= b) {
((uint32_t*)a)[1] = upper / b;
upper %= b;
}
__asm__("divl %2" : "=a" (((uint32_t*)a)[0]), "=d" (r) :
"rm" (b), "0" (((uint32_t*)a)[0]), "1" (upper));
return r;
#else
const uint64_t q = *a / b; /* Calls __udivdi3 in libgcc. */
const uint32_t r = *a - b * q; /* `r = *a % b' would use __umoddi3. */
*a = q;
return r;
#endif
}
If the line above with __udivdi3 doesn't compile for you, use the __div64_32 function from the Linux kernel: https://github.com/torvalds/linux/blob/master/lib/div64.c

Resources