OpenCL scalar vs vector - opencl

I have simple kernel:
__kernel vecadd(__global const float *A,
__global const float *B,
__global float *C)
{
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Why when I change float to float4, kernel runs more than 30% slower?
All tutorials says, that using vector types speeds up computation...
On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.
Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.
Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.
EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.
For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?

I don't know what are the tutorials you refer to, but they must be old.
Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now.
Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code.
It turns out scalar architecture is better for GPUs than vectored - it is better at utilizing the hardware resources.

I am not sure why the vectors would be that much slower for you, without knowing more about workgroup and global size. I would expect it to at least the same performance.
If it is suitable for your kernel, can you start with C having the values in A? This would cut down memory access by 33%. Maybe this applies to your situation?
__kernel vecadd(__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
C[idx] += B[idx];
}
Also, have you tired reading in the values to a private vector, then adding? Or maybe both strategies.
__kernel vecadd(__global const float4 *A,
__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
float4 tmp = A[idx] + B[idx];
C[idx] = tmp;
}

Related

Static variable in OpenCL C

I'm writing a renderer from scratch using openCL and I have a little compilation problem on my kernel with the error :
CL_BUILD_PROGRAM : error: program scope variable must reside in constant address space static float* objects;
The problem is that this program compiles on my desktop (with nvidia drivers) and doesn't work on my laptop (with nvidia drivers), also I have the exact same kernel file in another project that works fine on both computers...
Does anyone have an idea what I could be doing wrong ?
As a clarification, I'm coding a raymarcher which's kernel takes a list of objects "encoded" in a float array that is needed a lot in the program and that's why I need it accessible to the hole kernel.
Here is the kernel code simplified :
float* objects;
float4 getDistCol(float3 position) {
int arr_length = objects[0];
float4 distCol = {INFINITY, 0, 0, 0};
int index = 1;
while (index < arr_length) {
float objType = objects[index];
if (compare(objType, SPHERE)) {
// Treats the part of the buffer as a sphere
index += SPHERE_ATR_LENGTH;
} else if (compare(objType, PLANE)) {
//Treats the part of the buffer as a plane
index += PLANE_ATR_LENGTH;
} else {
float4 errCol = {500, 1, 0, 0};
return errCol;
}
}
}
__kernel void mkernel(__global int *image, __constant int *dimension,
__constant float *position, __constant float *aimDir, __global float *objs) {
objects = objs;
// Gets ray direction and stuf
// ...
// ...
float4 distCol = RayMarch(ro, rd);
float3 impact = rd*distCol.x + ro;
col = distCol.yzw * GetLight(impact);
image[dimension[0]*dimension[1] - idx*dimension[1]+idy] = toInt(col);
Where getDistCol(float3 position) gets called a lot by a lot of functions and I would like to avoid having to pass my float buffer to every function that needs to call getDistCol()...
There is no "static" variables allowed in OpenCL C that you can declare outside of kernels and use across kernels. Some compilers might still tolerate this, others might not. Nvidia has recently changed their OpenCL compiler from LLVM 3.4 to NVVM 7 in a driver update, so you may have the 2 different compilers on your desktop/laptop GPUs.
In your case, the solution is to hand the global kernel parameter pointer over to the function:
float4 getDistCol(float3 position, __global float *objects) {
int arr_length = objects[0]; // access objects normally, as you would in the kernel
// ...
}
kernel void mkernel(__global int *image, __constant int *dimension, __constant float *position, __constant float *aimDir, __global float *objs) {
// ...
getDistCol(position, objs); // hand global objs pointer over to function
// ...
}
Lonely variables out in the wild are only allowed as constant memory space, which is useful for large tables. They are cached in L2$, so read-only access is potentially faster. Example
constant float objects[1234] = {
1.0f, 2.0f, ...
};

OpenCL says CL_KERNEL_WORK_GROUP_SIZE is 256 - but accepts 1024; why?

On NVIDIA GPUs from recent years, one can run kernels with up to 1024 "threads" per block, or work-items per workgroup in OpenCL parlance - as long as the kernel uses less than 64 registers. Specifically, that holds for a simple kernel such as:
__kernel void vectorAdd(
__global unsigned char * __restrict C,
__global unsigned char const * __restrict A,
__global unsigned char const * __restrict B,
unsigned long length)
{
int i = get_global_id(0);
if (i < length)
C[i] = A[i] + B[i] + 2;
}
however, if we build this kernel and run:
built_kernel.getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(device);
this yields 256, not 1024 (on a Quadro RTX 6000 with CUDA 11.6.55 for example).
That seems like a bug with the API. Is it? Or - is there some legitimate reason why CL_KERNEL_WORK_GROUP_SIZE should be 256?

speedup when using float4, opencl

I have the following opencl kernel function to get the column sum of a image.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols)
{
int srcIdx = x ;
int dstIdx = x ;
float sum = 0;
for (int y = 0; y < srcRows; ++y)
{
sum += src[srcIdx];
dst[dstIdx] = sum;
srcIdx += srcStep;
dstIdx += dstStep;
}
}
}
I assign that each thread process a column here so that a lot of threads can get the column_sum of each column in parallel.
I also use float4 to rewrite the above kernel so that each thread can read 4 elements in a row at one time from the source image, which is shown below.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols/4)
{
int srcIdx = x ;
int dstIdx = x ;
float4 sum = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
for (int y = 0; y < srcRows; ++y)
{
float4 temp2;
temp2 = vload4(0, &src[4 * srcIdx]);
sum = sum + temp2;
vstore4(sum, 0, &dst[4 * dstIdx]);
srcIdx += (srcStep/4);
dstIdx += (dstStep/4);
}
}
}
In this case, theoretically, I think the time consumed by the second kernel to process a image should be 1/4 of the time consumed by the first kernel function. However, no matter how large the image is, the two kernels almost consume the same time. I don't know why. Can you guys give me some ideas? T
OpenCL vector data types like float4 were fitting better the older GPU architectures, especially AMD's GPUs. Modern GPUs don't have SIMD registers available for individual work-items, they are scalar in that respect. CL_DEVICE_PREFERRED_VECTOR_WIDTH_* equals 1 for OpenCL driver on NVIDIA Kepler GPU and Intel HD integrated graphics. So adding float4 vectors on modern GPU should require 4 operations. On the other hand, OpenCL driver on Intel Core CPU has CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT equal to 4, so these vectors could be added in a single step.
You are directly reading the values from "src" array (global memory). Which typically is 400 times slower than private memory. Your bottleneck is definitelly the memory access, not the "add" operation itself.
When you move from float to float4, the vector operation (add/multiply/...) is more efficient thanks to the ability of the GPU to operate with vectors. However, the read/write to global memory remains the same.
And since that is the main bottleneck, you will not see any speedup at all.
If you want to speed your algorithm, you should move to local memory. However you have to manually resolve the memory management, and the proper block size.
which architecture do you use?
Using float4 has higher instruction level parallelism (and then require 4 times less threads) so theoretically should be faster (see http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf)
However did i understand correctly in you kernel you are doing prefix-sum (you store the partial sum at every iteration of y)? If so, because of the stores the bottleneck is at the memory writes.
I think on the GPU float4 is not a SIMD operation in OpenCL. In other words if you add two float4 values the sum is done in four steps rather than all at once. Floatn is really designed for the CPU. On the GPU floatn serves only as a convenient syntax, at least on Nvidia cards. Each thread on the GPU acts as if it is scalar processor without SIMD. But the threads in a warp are not independent like they are on the CPU. The right way to think of the GPGPU models is Single Instruction Multiple Threads (SIMT).
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Have you tried running your code on the CPU? I think the code with float4 should run quicker (potentially four times quicker) than the scalar code on the CPU. Also if you have a CPU with AVX then you should try float8. If the float4 code is faster on the CPU than float8 should be even faster on a CPU with AVX.
try to define __ attribute __ to kernel and see changes in run timing
for example try to define:
__ kernel void __ attribute__((vec_type_hint(int)))
or
__ kernel void __ attribute__((vec_type_hint(int4)))
or some floatN as you want
read more:
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/functionQualifiers.html

OpenCL kernel work-group size restriction

So I keep running into strange errors when I call my kernels; the stated max kernel work-group size is one, while the work group size of my device (my Macbook) is decidedly higher than that. What possible causes could there be for the kernels restricting the code to a single work group? Here's one of my kernels:
__kernel
void termination_kernel(const int Elements,
__global float* c_I,
__global float* c_Ihat,
__global float* c_rI,
__local float* s_a)
{
const int bdim = 128;
int n = get_global_id(0);
const int tx = get_local_id(0); // thread index in thread-block (0-indexed)
const int bx = get_group_id(0); // block index (0-indexed)
const int gx = get_num_groups(0);
// is thread in range for the addition
float d = 0.f;
while(n < Elements){
d += pow(c_I[n] - c_Ihat[n], 2);
n += gx * bdim;
}
// assume bx power of 2
int alive = bdim / 2;
s_a[tx] = d;
barrier(CLK_LOCAL_MEM_FENCE);
while(alive > 1){
if(tx < alive)
s_a[tx] += s_a[tx + alive];
alive /= 2;
barrier(CLK_LOCAL_MEM_FENCE);
}
if(tx == 0)
c_rI[bx] = s_a[0] + s_a[1];
}
and the error returned is
OpenCL Error (via pfn_notify): [CL_INVALID_WORK_GROUP_SIZE] : OpenCL Error : clEnqueueNDRangeKernel
failed: total work group size (128) is greater than the device can support (1)
OpenCL Error: 'clEnqueueNDRangeKernel(queue, kernel_N, dim, NULL, global_N, local_N, 0, NULL, NULL)'
I know it says the restriction is on the device, but debugging shows that
CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
and
CL_KERNEL_WORK_GROUP_SIZE = 1
The kernel construction is called by
char *KernelSource_T = readSource("Includes/termination_kernel.cl");
cl_program program_T = clCreateProgramWithSource(context, 1, (const char **) &KernelSource_T, NULL, &err);
clBuildProgram(program_T, 1, &device, flags, NULL, NULL);
cl_kernel kernel_T = clCreateKernel(program_T, "termination_kernel", &err);
I'd include the calling function, but I'm not sure if it's relevant; my intuition is that it's something in the kernel code that's forcing the restriction. Any ideas? Thanks in advance for the help!
Apple OpenCL doesn't support work-groups larger than [1, 1, 1] on the CPU. I have no idea why, but that's how it's been at least up to OSX 10.9.2. Larger work-groups are fine on the GPU, though.
CL_KERNEL_WORK_GROUP_SIZE tells you how large the maximum work group size can be for this particular kernel. OpenCL's runtime determines that by inspecting the kernel code. CL_KERNEL_WORK_GROUP_SIZE will be a number less or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE.
Hope the amount of local memory avilable is less for that work group size . Please can you show the arguments? . You can try by reducing the work group size , start with 2,4,8,16,32,64,128 so on make sure its power of 2.
Time has passed since the answer of Tomi and it seems that Apple has become slightly more flexible on this aspect. On my OS X 10.12.3 (still OpenCL 1.2), it is possible to use up to CL_DEVICE_MAX_WORK_GROUP_SIZE in the first dimension.
According to the specification, it is also possible to get the maximum number of work-groups for each dimension through CL_DEVICE_MAX_WORK_ITEM_SIZES according to the documentation

OpenCL modulo of large numbers

I'm trying to calculate a mod b in OpenCL, where a is an array of ulong elements, and is twice the length of b.
__kernel void mod(__global ulong *a, __global ulong *b, __global ulong length) {
// length = len(a) = 2 * len(b)
...
}
What I want is something like a %= b, but with arrays. The arrays represent numbers of course, with their last element representing the least significant bits.
Is it possible to do this in-place (i.e. without allocating extra memory)? What is a good algorithm for calculating the medulus for large numbers?
Note that neither of the two numbers can be easily represented in another way (e.g. using exponents). Most of the times they will be pseudoprimes. Also, having some concurrency would be nice.
Pointers to any useful material on this are welcome.
EDIT: if that helps, length can be known at compile time.
EDIT: I'm sorry I wasn't clear here. I'm not working on an array of integers, I'm working on two big integers, for example a is 8Mb (a 67108864-bit number) and b is 4Mb (a 33554432-bit number). I work them in base 2^64, hence the arrays of ulong integers. Basically, those are just the digits of the number.
You just do:
__kernel void mod(__global ulong *a, __global ulong *b, __global ulong length) {
ulong id = get_global_id(0) ;
a[id] = a[id] % b[id];
}
I don't really understand your problem, the arrays size difers? Or maybe you want a more special calculation?

Resources