Opencl: Copying from __constant to __global memory - opencl

I have two kernels (both of them runs only once, so the globalWorkSize is 1 in the example):
The first kernel (kernel_Calc()) calculates some values and stores them in the __global memory. In this example it calculates (sets the transformation matrix which translates a point in 3D space) a transformation matrix and transforms the origo.
inline
float4 mul( const float4 M[ 4 ], const float4 v)
{
float4 r;
r.x = dot( v, M[ 0 ] );
r.y = dot( v, M[ 1 ] );
r.z = dot( v, M[ 2 ] );
r.w = dot( v, M[ 3 ] );
return r;
}
__kernel
void kernel_Calc( __global float4* g_TransformationMatrices, __global float3* g_Point3D )
{
__private float4 transformationMatrix[ 4 ];
transformationMatrix [ 0 ] = (float4) ( 1.0f, 0.0f, 0.0f, 0.0f );
transformationMatrix [ 1 ] = (float4) ( 0.0f, 1.0f, 0.0f, 10.0f );
transformationMatrix [ 2 ] = (float4) ( 0.0f, 0.0f, 1.0f, 0.0f );
transformationMatrix [ 3 ] = (float4) ( 0.0f, 0.0f, 0.0f, 1.0f );
g_TransformationMatrices[ 0 ] = transformationMatrix[ 0 ];
g_TransformationMatrices[ 1 ] = transformationMatrix[ 1 ];
g_TransformationMatrices[ 2 ] = transformationMatrix[ 2 ];
g_TransformationMatrices[ 3 ] = transformationMatrix[ 3 ];
float4 point4D = (float4) ( 0.0f, 0.0f, 0.0f, 1.0f );
float4 point4DTransformed = mul( transformationMatrix, point4D);
g_Point3D[ 0 ] = (float3) ( point4DTransformed.x / point4DTransformed.w ,
point4DTransformed.y / point4DTransformed.w ,
point4DTransformed.z / point4DTransformed.w );
}
On the host side I copy the calculated __global buffers into __constant buffers (CL_MEM_READ_ONLY buffers) with clEnqueueCopyBuffer() function. (I do this because I hope read from the __constant memory will be faster than read from the __global memory. The buffer copy can be done on the device side with this function without copy the __global back to the host and then copy it into the __constant.)
The second kernel (kernel_Test()) tries to load the calculated values into a __global variable (__global float4* test) which can be read on the host side. The sizeStruct is a user defined struct which contains only an integer (this is the number of the matrices and transformed points). The second and the third parameters are the buffers in the __constant memory which were filled up with clEnqueueCopyBuffer() function.
struct sizeStruct
{
int m_Size;
};
__kernel
void kernel_Test( __constant struct sizeStruct* c_SS,
__constant float4* c_TransformationMatrices,
__constant float3* c_Points3D,
__global float4 *test )
{
test[ 0 ] = c_TransformationMatrices[ 0 ];
test[ 1 ] = c_TransformationMatrices[ 1 ];
test[ 2 ] = c_TransformationMatrices[ 2 ];
test[ 3 ] = c_TransformationMatrices[ 3 ];
}
The problem is when I run the program the test variable contains this:
1.000000, 0.000000, 0.000000, 0.000000
0.000000, 0.000000, 0.000000, 0.000000
0.000000, 0.000000, 0.000000, 0.000000
0.000000, 0.000000, 0.000000, 0.000000
but it should contain:
1.000000, 0.000000, 0.000000, 0.000000
0.000000, 1.000000, 0.000000, 10.000000
0.000000, 0.000000, 1.000000, 0.000000
0.000000, 0.000000, 0.000000, 1.000000
I checked the __constant variables (by copying them to the host memory) they contain the correct data. The code is a simplified version of my program. That is the reason it may contain unnecessary operations and parameters. The example is tested and works as I described.
More interesting when I change __constant float3* c_Points3D kernel parameter to __global float3* c_Points3D kernel parameter (but still use the read_only buffer which were filled up with clEnqueueCopyBuffer() function) it works fine. It also works when I remove the __constant struct sizeStruct* c_SS parameter.
So it seems something is wrong with the address spaces of the arguments of the kernel_Test but the problem appears at __constant -> __global copy.
I’m running the program on nvidia geforce gtx 690 but I can change the device (and the platform) to intel i7-3930k (using intel sdk). Using intel-i7 cpu everything works fine without any change in the kernel code.
Q1: Why does this weird behaviour appear? Does anybody have an idea what am I doing wrong?
Q2: Is it legal to create a buffer with cl_mem_read_only and use with __global address space qualifier?

Q1: Why does this weird behaviour appear? Does anybody have an idea what am I doing wrong?
Nothing obvious comes to mind, can you post a complete minimum working example that compiles on Windows? I'd like to test on AMD Radeon GPUs.
Q2: Is it legal to create a buffer with cl_mem_read_only and use with __global address space qualifier?
Yes that is legal, although you will want to add const to specify that the buffer is read-only, see section 6.5.1 of the OpenCL 1.2 specification.

Related

OpenCL : Using vload Vs vector pointer

Is there any advantage of using vload Vs directly assigning vector pointer? Which would be faster in mobile GPU's with less computational power & bandwidth?
Ex:
vload sample
__kernel vec_add(__global const float* a, __global const float* b, __global float* c){
float4 a_sub;
float4 b_sub;
float4 c_sub;
a_sub = vload4(0, &a[get_global_id(0)]);
b_sub = vload4(0, &b[get_global_id(0)]);
c_sub = a_sub + b_sub;
vstore(c_sub, 0, &c[get_global_id(0)]);
}
vector pointer sample
__kernel vec_add(__global const float* a, __global const float* b, __global float* c){
float4 a_sub;
float4 b_sub;
float4 c_sub;
a_sub = ((global const float4*)a)[get_global_id(0)];
b_sub = ((global const float4*)b)[get_global_id(0)];;
c_sub = a_sub + b_sub;
vstore(c_sub, 0, &c[get_global_id(0)]);
}
As stated in the comments it depends on the target hardware which is the fastest way to load the data. In other words, you should benchmark which is the best or are there any difference. However, I can't remember achieving any speed up by changing the syntax.
If you need to work with float* buffers then the third option to try is to write the same load like this:
a_sub = (float4){
a[get_global_id(0) * 4 + 0],
a[get_global_id(0) * 4 + 1],
a[get_global_id(0) * 4 + 2],
a[get_global_id(0) * 4 + 3]
};
However, many times there is no reason to work with float* buffers and you could use float4* buffers. In that case the compiler can definitely know that the load is going to be aligned. Moreover, I have seen significant speed ups on mobile platforms when I changed the buffer type. So in your case the kernel signature would look like:
__kernel vec_add(__global const float4* a, __global const float4* b, __global float4* c){
and the load would be:
a_sub = a[get_global_id(0)];
The vloadn and vstoren require alignment while "traditional C syntax" does not. So there is certainly opportunity for significant optimization in the former. Of course it will be up to your specific implementation so profiling is needed.

OpenGL and Qt 5.5: glm::perspective doesnt work

I have made the following perspective matrix to isolate the problem to the glm perspective function:
QMatrix4x4 proj (1.f, 0.f, 0.f, 0.f,
0.f, 1.f, 0.f, 0.f,
0.f, 0.f, 1.f, 0.0f,
0.f, 0.f, 1.1f, 1.f);
This works and produces an image. However, when trying to use glm to construct the perspective matrix as so:
glm::mat4 proj;
proj = glm::perspective(
glm::radians(80.0f),
1.0f,
0.0f,
2.0f
);
Nothing comes up.
I was under the impression that when putting 0.0f, 2.0f into the near plane, far plane arguments, any vertex coordinate in the range 0.0f-2.0f was linearly interpolated into the coordinate system -1.0f to 1.0f to be used as normalized device coordinates. However, no matter which pair of values I put here, nothing is rendered.
Here's the coordinates im trying to draw:
rawverts = {
0.0f, 0.0f, 1.0f,
0.0f, 0.7f, 1.0f,
0.4f, 0.0f, 1.0f,
0.0f, -0.7f, 1.0f,
-0.4f, 0.0f, 1.0f
};
and when passing the projection matrix to the vertex shader:
int projIndex = shaders->uniformLocation("proj");
...
shaders->setUniformValue(projIndex, QMatrix4x4(glm::value_ptr(proj)) );
The vertex shader itself:
#version 330 core
in vec3 vertex;
uniform mat4 translate;
uniform mat4 view;
uniform mat4 proj;
uniform float time;
uniform float aspect;
uniform vec2 resolution;
void main() {
gl_Position = proj * view * translate * vec4(vertex, 1);
}
You MUST pass the QMatrix4x4 transposed matrix of the glm matrix.
Then instead of using.
shaders->setUniformValue(projIndex, QMatrix4x4(glm::value_ptr(proj)) );
you must use :
shaders->setUniformValue(projIndex, QMatrix4x4(glm::value_ptr(proj)).transposed());
Everytime you must send the transposed matrix when you combine QMatrix4x4 and glm::mat4
it should work.
JC
PS : for the record, I also struggle 2 days on this:(

How to handle case for buffers which are out of range in OpenCl?

OpenCl kernel Code:
__kernel void calc(__global double* arr1, __global double* arr2, __global double* arr3)
{
int g = get_global_id(0);
if (g > 6400) {
return;
}
arr1[g] = arr3[g] * 10.0;
arr2[g] = arr3[g] * 20.0
}
Workgroup size: 6400
This kernel will solve, arr1 and arr2 using arr3. But issue is arr2 is only size of 3200. Hence, this will not work. How to handle case for buffers which are out of range in OpenCl?
I can think using condition statement. But i do not want any if-else statement, like:
__kernel void calc(__global double* arr1, __global double* arr2, __global double* arr3)
{
int g = get_global_id(0);
if (g > 6400) {
return;
}
arr1[g] = arr3[g] * 10.0;
if (g < 3200) {
arr2[g] = arr3[g] * 20.0
}
}
I also do not want to add any dummy buffer or extra buffer in arr2, which will increase my buffer size.
How, can i handle such scenarios in OpenCl? Any help/suggestions/links will be appreciated.

OpenCL pass by reference different addres space

Short storry:
I have function with pass by reference of output variables
void acum( float2 dW, float4 dFE, float2 *W, float4 *FE )
which is supposed increment variables *W, *FE, by dW, dFE if some coditions are fulfilled.
I want to make this function general - the output varibales can be either local or global.
acum( dW, dFE, &W , &FE ); // __local acum
acum( W, FE, &Wout[idOut], &FEout[idOut] ); // __global acum
When I try to compile it I got error
error: illegal implicit conversion between two pointers with different address spaces
is it possible to make it somehow??? I was thinking if I can use macro instead of function (but I'm not very familier with macros in C).
An other possibility would be probably:
to use function which returns a struct{Wnew, FEnew}
or (float8)(Wnew,FEnew,0,0) but I don't like it because it makes the code more confusing.
Naturally I also don't want to just copy the body of "acum" all over the places (like manual inlining it :) )
Backbround (Not necessary to read):
I'm trying program some thermodynamical sampling using OpenCL. Because statistical weight W = exp(-E/kT) can easily overflow float (2^64) for low temperature, I made a composed datatype W = float2(W_digits, W_exponent) and I defined functions to manipulate these numbers "acum".
I try to minimize number of global memory operations, so I let work_items go over Vsurf rather than FEout, becauese I expect that only few points in Vsurf would have considerable statistical weight, so accumulation of FEout would be called only afew times for each workitem. While iteratin over FEout would require sizeof(FEout)*sizeof(Vsurf) memory operations.
The whole code is here (any recomandations how to make it more efficinet are welcome):
// ===== function :: FF_vdW - Lenard-Jones Van Der Waals forcefield
float4 FF_vdW ( float3 R ){
const float C6 = 1.0;
const float C12 = 1.0;
float ir2 = 1.0/ dot( R, R );
float ir6 = ir2*ir2*ir2;
float ir12 = ir6*ir6;
float E6 = C6*ir6;
float E12 = C12*ir12;
return (float4)(
( 6*E6 - 12*E12 ) * ir2 * R
, E12 - E6
);}
// ===== function :: FF_spring - harmonic forcefield
float4 FF_spring( float3 R){
const float3 k = (float3)( 1.0, 1.0, 1.0 );
float3 F = k*R;
return (float4)( F,
0.5*dot(F,R)
);}
// ===== function :: EtoW - compute statistical weight
float2 EtoW( float EkT ){
float Wexp = floor( EkT);
return (float2)( exp(EkT - Wexp)
, Wexp
); }
// ===== procedure : addExpInplace -- acumulate F,E with statistical weight dW
void acum( float2 dW, float4 dFE, float2 *W, float4 *FE )
{
float dExp = dW.y - (*W).y; // log(dW)-log(W)
if(dExp>-22){ // e^22 = 2^32 , single_float = 2^+64
float fac = exp(dExp);
if (dExp<0){ // log(dW)<log(W)
dW.x *= fac;
(*FE) += dFE*dW.x;
(*W ).x += dW.x;
}else{ // log(dW)>log(W)
(*FE) = dFE + fac*(*FE);
(*W ).x = dW.x + fac*(*W).x;
(*W ).y = dW.y;
}
}
}
// ===== __kernel :: sampler
__kernel void sampler(
__global float * Vsurf, // in : surface potential (including vdW) // may be faster to precomputed endpoints positions like float8
__global float4 * FEout, // out : Fx,Fy,Fy, E
__global float2 * Wout, // out : W_digits, W_exponent
int3 nV ,
float3 dV ,
int3 nOut ,
int3 iOut0 , // shift of Fout in respect to Vsurf
int3 nCopy , // number of copies of
int3 nSample , // dimension of sampling in each dimension around R0 +nSample,-nSample
float3 RXe0 , // postion Xe relative to Tip
float EcutSurf ,
float EcutTip ,
float logWcut , // accumulate only when log(W) > logWcut
float kT // maximal energy which should be sampled
) {
int id = get_global_id(0); // loop over potential grid points
int idx = id/nV.x;
int3 iV = (int3)( idx/nV.y
, idx%nV.y
, id %nV.x );
float V = Vsurf[id];
float3 RXe = dV*iV;
if (V<EcutSurf){
// loop over tip position
for (int iz=0;iz<nOut.z;iz++ ){
for (int iy=0;iy<nOut.y;iy++ ){
for (int ix=0;ix<nOut.x;ix++ ){
int3 iTip = (int3)( iz, iy, ix );
float3 Rtip = dV*iTip;
float4 FE = 0;
float2 W = 0;
// loop over images of potential
for (int ix=0;ix<nCopy.x;ix++ ){
for (int iy=0;iy<nCopy.y;iy++ ){
float3 dR = RXe - Rtip;
float4 dFE = FF_vdW( dR );
dFE += FF_spring( dR - RXe0 );
dFE.w += V;
if( dFE.w < EcutTip ){
float2 dW = EtoW( - FE.w / kT );
acum( dW, dFE, &W , &FE ); // __local acum
}
}
}
if( W.y > logWcut ){ // accumulate force
int idOut = iOut0.x + iOut0.y*nOut.x + iOut0.z*nOut.x*nOut.y;
acum( W, FE, &Wout[idOut], &FEout[idOut] ); // __global acum
}
}}}}
}
I'm using pyOpenCL on ubuntu 12.04 64bit but I think it has nothink to do with the problem
OK, here what's happening, from the OpenCL man pages:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/global.html
"If the type of an object is qualified by an address space name, the object is allocated in the specified address name; otherwise, the object is allocated in the generic address space"
...
"The generic address space name for arguments to a function in a program, or local variables of a function is __private. All function arguments shall be in the __private address space. "
So your acum(... ) function args are in the __private address space.
So the compiler is correctly saying that
acum( ..&Wout[idOut], &FEout[idOut] )
is being called with &Wout and &FEout in the global addrress space when the function args must in the private address space.
The solution is conversion between global and private.
Create two private temporary vars to receive the results.
call acum( ... ) with these vars.
assign the temporary private values to the global values, after you've called acum( .. )
The code will look is bit messy
Remember on a GPU you have many address spaces, you can't magically jump between them by casting. You have to move data explicitly between address spaces by assignment.

Can this OpenCL code be optimized?

I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v, two DxD matrices A and B and a constant c, return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j])
Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD matrix is about ten times faster. It's called from PyOpenCL if that makes any difference.
Is anything done wrong? Could it be optimized?
#define D 1000
...
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
int y = get_global_id(1);
float sum = 0;
for(int k = 0; k < D; k++)
{
sum += vector[k] * matrix[(y*D) + k]
* matrix2[(y*D) + k ];
}
result[y] = sum * factor;
}
Cheers!
Optimization #1: make vector __local.
My first pass at this got a decent improvement in performance. I noticed that each vector[k] is read a total of D times, so I copied it to a __local. This is only possible because D is small enough to allow this. The kernel as you have it above suffers from a terrible ALU:fetch ratio of 0.08 on both the 5870 and the 6970 gpus. Even the slower gpus are still waiting on the memory access.
#define D 1000
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
int y = get_global_id(0);
float sum = 0;
__local float vectCopy[D];
int ls = get_local_size(0);
int lid = get_local_id(0);
for(int i=0;i<D;i+=ls){
vectCopy[i+lid] = vector[i+lid];
}
mem_fence(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < D; k++)
{
sum += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
}
result[y] = sum * factor;
}
With this change, APP profiler is showing a new ALU:fetch ratio of 0.20 for the 5870 and 6970 gpus. Average times changed from 1513-->1034, and 1261-->861 on the same cards. The low end gpus are now bound by ALU instead of fetch. (greater than 4:1 ratio)
Opimization #2: calculate each result[y] using an entire work group.
You would have to do this id D were much larger (100k+). The idea is to get the best memory access pattern by using the work group to compute a single element of the result at a time. I defined ls (local size) to be 64 here, because it works on my hardware, as well as most vendors'. The workgroup size you use from the host-side will have to be 64 unless you change that definition. It needs to be defined to create the sum[ls] storage as __local, and I don't like passing variable sized __local vars into my kernels.
results: 5870 ALU:fetch=0.59:1, avg=708. 6970 ALU:fetch=0.72, avg=590. According to APP profiler, this is about twice as fast as your original listing.
#define D 1000
#define ls 64
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
__local float vectCopy[D];
int lid = get_local_id(0);
for(int i=0;i<D;i+=ls){
vectCopy[i+lid] = vector[i+lid];
}
mem_fence(CLK_LOCAL_MEM_FENCE);
int ng = get_num_groups(0);
int gid = get_group_id(0);
int y, k;
__local float sum[ls];
for(y = gid; y < D; y+=ng){
for(k = lid; k < D; k+=ls)
{
sum[lid] += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
}
if(lid==0){
result[y] = sum[0];
for(k=1;k<ls;k++){
result[y] += sum[k];
}
result[y] *= factor;
}
mem_fence(CLK_LOCAL_MEM_FENCE);
}
}
EDIT: APP profiler = AMD APP KernelAnalyzer

Resources