Where should I define a C function that will be called in C kernel code when using PYOPENCL - opencl

Since Kernel Code in PyOpenCl needs to be written only in C, I have written few functions that need to be called inside the Kernel code in PyOpenCL.Where should I store these functions? how to pass a global variable to that function.
In PyOpenCl my kernel code looks like this:
program = cl.Program(context, """
__kernel void Kernel_OVERLAP_BETWEEN_N_IP_GPU(__constant int *FBNs_array,__local int *Binary_IP, __local int *cc,__global const int *olp)
{
function1(int *x, int *y,__global const int *olp);
}
""").build()
Where should I write and store the function1 function. should I define it in kernel itself, or in some other file and provide a path. If i need to define it at some other place and provide a path, please provide me some details , I am completely new to C.
Thanks

Like in C, before the kernel.
program = cl.Program(context, """
void function1(int *x, int *y)
{
//function1 code
}
__kernel void kernel_name()
{
function1(int *x, int *y);
}""").build()

program = cl.Program(context, """
void function1(int x, int *y,__global const int *cc)
{
x=10;
}
__kernel void kernel_name(__global const int *cc)
{
int x=1;
int y[1]={10};
function1(x,y,cc); //now x=10
}""").build()

Related

OpenCL sum `cl_khr_fp64` double values into a single number

From this question and this question I managed to compile a minimal example of summing a vector into a single double inside OpenCL 1.2.
/* https://suhorukov.blogspot.com/2011/12/opencl-11-atomic-operations-on-floating.html */
inline void AtomicAdd(volatile __global double *source, const double operand) {
union { unsigned int intVal; double floatVal; } prevVal, newVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = prevVal.floatVal + operand;
} while( atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal );
}
void kernel cost_function(__constant double* inputs, __global double* outputs){
int index = get_global_id(0);
if(0 == error_index){ outputs[0] = 0.0; }
barrier(CLK_GLOBAL_MEM_FENCE);
AtomicAdd(&outputs[0], inputs[index]); /* (1) */
//AtomicAdd(&outputs[0], 5.0); /* (2) */
}
As in fact this solution is incorrect because the result is always 0 when the buffer is accessed. What might the problem with this?
the code at /* (1) */ doesn't work, and neither does the code at /* (2) */, which is only there to test the logic independent of any inputs.
Is barrier(CLK_GLOBAL_MEM_FENCE); used correctly here to reset the output before any calculations are done to it?
According to the specs in OpenCL 1.2 single precision floating point numbers are supported by atomic operations, is this(AtomicAdd) a feasible method of extending the support to double precision numbers or am I missing something?
Of course the device I am testing with supports cl_khr_fp64˙of course.
Your AtomicAdd is incorrect. Namely, the 2 errors are:
In the union, intVal must be a 64-bit integer and not 32-bit integer.
Use the 64-bit atom_cmpxchg function and not the 32-bit atomic_cmpxchg function.
The correct implementation is:
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
inline void AtomicAdd(volatile __global double *source, const double operand) {
union { unsigned ulong u64; double f64; } prevVal, newVal;
do {
prevVal.f64 = *source;
newVal.f64 = prevVal.f64 + operand;
} while(atom_cmpxchg((volatile __global ulong*)source, prevVal.u64, newVal.u64) != prevVal.u64);
}
barrier(CLK_GLOBAL_MEM_FENCE); is used correctly here. Note that a barrier must not be in an if- or else-branch.
UPDATE: According to STREAMHPC, the original implementation you use is not guaranteed to produce correct results. There is an improved implementation:
void __attribute__((always_inline)) atomic_add_f(volatile global float* addr, const float val) {
union {
uint u32;
float f32;
} next, expected, current;
current.f32 = *addr;
do {
next.f32 = (expected.f32=current.f32)+val; // ...*val for atomic_mul_f()
current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
} while(current.u32!=expected.u32);
}
#ifdef cl_khr_int64_base_atomics
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
void __attribute__((always_inline)) atomic_add_d(volatile global double* addr, const double val) {
union {
ulong u64;
double f64;
} next, expected, current;
current.f64 = *addr;
do {
next.f64 = (expected.f64=current.f64)+val; // ...*val for atomic_mul_d()
current.u64 = atom_cmpxchg((volatile global ulong*)addr, expected.u64, next.u64);
} while(current.u64!=expected.u64);
}
#endif

pyopencl - how to use generic types?

I work Interchangeably with 32 bit floats and 32 bit integers. I want two kernels that do exactly the same thing, but one is for integers and one is for floats. At first I thought I could use templates or something, but it does not seem possible to specify two kernels with the same name but different argument types?
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void arange(__global int *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
__kernel void arange(__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
""").build()
Error:
<kernel>:8:15: error: conflicting types for 'arange'
__kernel void arange(__global float *res_g)
^
<kernel>:2:15: note: previous definition is here
__kernel void arange(__global int *res_g)
What is the most convenient way of doing this?
#define directive can be used for that:
code = """
__kernel void arange(__global TYPE *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
"""
prg_int = cl.Program(ctx, code).build("-DTYPE=int")
prg_float = cl.Program(ctx, code).build("-DTYPE=float")

Can the name of a running CUDA kernel be obtained by its threads?

Suppose some kernel (a __global__ function named foo) is running on a CUDA device. And suppose that kernel calls a __device__ function bar which is sometimes called from other kernels, i.e. the code of bar does not know at compile-time whether the kernel is foo or something else.
Can a thread running foo, within bar, obtain either the name "foo", the signature, or some other identifier of the kernel, preferable a human-readable one?
If necessary, assume the code has been compiled with any of --debug, --device-debug and/or --lineinfo.
The kernel can read the special register %gridid. %gridid is unique per launch. If performance then a simple kernel prolog can have one thread from each kernel launch output the gridid global function map using func and %gridid. Alternatively, the CUPTI SDK Activity API can be used to collect this information. The CUpti_ActivityKernel2 event contains per launch meta-data including the gridId and CUfunction name.
Here is an example reading %gridid.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdint.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
static __device__ __inline__ uint64_t __gridid()
{
uint64_t gridid;
asm volatile("mov.u64 %0, %%gridid;" : "=l"(gridid));
return gridid;
}
__device__ void devPrintName()
{
static const char* name = __func__;
printf("%llu %s\n", __gridid(), name);
}
__global__ void globPrintName()
{
static const char* name = __func__;
printf("%llu %s\n", __gridid(), name);
devPrintName();
}
int main()
{
for (int i = 0; i < 4; ++i)
{
globPrintName<<<1,1,0>>>();
cudaDeviceReset();
}
return 0;
}
This sample outputs
1 globPrintName
1 devPrintName
2 globPrintName
2 devPrintName
3 globPrintName
3 devPrintName
4 globPrintName
4 devPrintName

Equivalent of memcpy in opencl

I'm new to opencL and this question might look silly.
I have a kernel which takes two structures A and C. I want to copy contents of structure A to structure C.
Structure looks like below:
struct Block {
bool used;
int size;
intptr_t data[1];
};
__kernel void function(__global struct Block *A, __global struct Block *C) {
//Do something on A
//COPY A to C by memcpy alternative
}
Is there any function like memcpy which I can use inside kernel?. I'm using opencl in integrated GPU with zero copy.
Or Do I have to copy block by block to structure C?.
In your case you can simply assign the structures:
__kernel void function(__global struct Block *A, __global struct Block *C) {
//Do something on A
*C = *A;
}
It's same as in plain C, yet many programmers don't know they can assign structures and resort to memcpy.

Arduino Deputy function compute array

I use an Arduino Uno with Arduino IDE 1.8.3. I have two arrays. I want to write a Deputy function that can add two arrays, and return the result to the main function and print it.
But I want to use x(sizeof(a)), but it seems not correct...
How do I solve this problem?
This is my code:
int a[]={1,2,3,4,5,6},b[]={1,1,1,1,1,1};
void setup() {
Serial.begin(9600);
int *p;
p = add(a,b);
for(int i=0;i<4;i++){
Serial.print(*(p+i));
}
}
void loop() {
}
int * add(int *a,int *b) {
int x = sizeof(a);
int y = sizeof(b);
static int z[4];
for(int i=0;i<4;i++) {
z[i]=a[i]+b[i];
}
return z;
}
int* a does not know the size of the array.
Easiest pass it as an extra parameter.
The next problem is that your static result cannot change its size dynamically.
static has additional problems anyway, in general.
int* add(const int *a,const int *b, int* result, byte size) {
for(byte i=0; i<size; i++) {
result[i]=a[i]+b[i];
}
return result;
}
Returning the result as the return value may be convenient.

Resources