ViennaCL on OS X: Can't copy data to GPU

ViennaCL on OS X: Can't copy data to GPU - opencl

I'm just learning to work with ViennaCL. The first tries on the CPU worked fine, now I am trying to use OpenCL. However, I can't manage to get data onto the GPU - while the matrices seem to be created, they don't get any contents:
#define VIENNACL_WITH_OPENCL
#define VIENNACL_WITH_UBLAS
#include <boost/numeric/ublas/matrix.hpp>
#include "viennacl/matrix.hpp"
int main() {
boost::numeric::ublas::matrix<float> data_cpu(1,1);
data_cpu(0,0) = 1;
viennacl::matrix<float> data_gpu(1,1);
viennacl::copy(data_cpu, data_gpu);
assert(data_cpu(0,0) == data_gpu(0,0));
}
After this, data_gpu(0,0) is 0 but I believe it should be 1.
I'm compiling this with g++ nocopy.cpp -framework OpenCL. I am using OS X with the provided OpenCL driver.
What am I doing wrong here?
Edit: Removing VIENNACL_WITH_OPENCL fixes the problem, but is not what I want.

Looks like (my?) OS X installation of OpenCL is somehow broken. Other, plain, OpenCL examples fail as well:
noname:histogram Markus$ ./histogram
OpenCL Device Vendor = NVIDIA, OpenCL Device Name = GeForce GT 650M, OpenCL Device Version = OpenCL 1.1
Image Histogram for image type = CL_RGBA, CL_UNORM_INT8: verify_histogram_results failed for indx = 0, gpu result = 0, expected result = 8204
Image dimensions: 1920 x 1080 pixels, Image type = CL_RGBA, CL_UNORM_INT8
Time to compute histogram = 0 ms
Image Histogram for image type = CL_RGBA, CL_FLOAT: verify_histogram_results failed for indx = 0, gpu result = 0, expected result = 8049
Image dimensions: 1920 x 1080 pixels, Image type = CL_RGBA, CL_FLOAT
Time to compute histogram = 0 ms
noname:histogram Markus$ pwd
/Users/Markus/Desktop/tmp/opencl-book-samples-read-only/src/Chapter_14/histogram

Related

Random NaN and incorrect results with OpenCL kernel

I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.
The Kernel
I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:
__kernel
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
__global const float* A, __global const float* B, const float beta,
__global float* C, __local float* Asub, __local float* Bsub) {
const uint row = get_local_id(0);
const uint col = get_local_id(1);
const uint TS = get_local_size(0); // Tile size
const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)
// Initialise the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numtiles = K / TS;
for (int t = 0; t < numtiles; t++) {
const int tiledRow = TS * t + row;
const int tiledCol = TS * t + col;
Asub[col * TS + row] = A[tiledCol * M + globalRow];
Bsub[col * TS + row] = B[globalCol * K + tiledRow];
barrier(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < TS; k++) {
acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
}
Tile Size (TS) is now a value defined in the calling code, which looks like this:
// A, B and C are 2D matrices, their cl::Buffers have already been set up
// and values appropriately set.
kernel.setArg(0, (cl_int)nrowA);
kernel.setArg(1, (cl_int)ncolB);
kernel.setArg(2, (cl_int)ncolA);
kernel.setArg(3, alpha);
kernel.setArg(4, A_buffer);
kernel.setArg(5, B_buffer);
kernel.setArg(6, beta);
kernel.setArg(7, C_buffer);
kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));
cl::NDRange global(nrowA, ncolB);
cl::NDRange local(nrowA, ncolB);
status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);
The Problem
The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)
I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.
However, I can get this correct result maybe just 75% of the time.
Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.
I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.
What am I doing wrong here?
Things I have tried
I have verified that the data going into the kernels are correct.
I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)
I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.
Other points to note
I am using the C++ wrapper retrieved directly from the Github page.
To use the wrapper, I have defined CL_HPP_MINIMUM_OPENCL_VERSION 120 and CL_HPP_TARGET_OPENCL_VERSION 120.
I am compiling the kernels with the -cl-std=CL1.2 flag.
All cl::Buffers are created with only the CL_MEM_READ_WRITE flag.
I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
All 3 platforms exhibit the same problem with varying frequency.

This looks like a complicated one. There are several things to address and they won't fit into comments, so I'll post all this as an answer even though it does not solve your problem (yet).
I am baffled because if I have made an algorithmic or mathematical
mistake, the mistake should be consistent; instead I am getting
incorrect results only randomly.
Such a behavior is a typical indicator of race conditions.
I have tried to initialize both __local memory to 0.0, but this causes
all results to become wrong (but frankly, I'm not really sure how to
initialize it properly)
Actually this is a good thing. Finally we have some consistency.
Initializing local memory
Initializing local memory can be done using the work items, e.g. if you have a 1D workgroup of 16 items and your local memory consists of 16 floats, just do this:
local float* ptr = ... // your pointer to local memory
int idx = get_local_id(0); // get the index for the current work-item
ptr[idx] = 0.f; // init with value 0
barrier(CLK_LOCAL_MEM_FENCE); // synchronize local memory access within workgroup
If your local memory is larger, e.g. 64 floats, you will have to use a loop where each work item initializes 4 values, at least that is the most efficient way. However, no one will stop you from using every work item to initialize every value in the local memory, even though that is complete nonsense since you're essentially initializing it multiple times.
Your changes
The original algorithm looks like it is especially designed to use quadratic tiles.
__local float Asub[TS][TS];
__local float Bsub[TS][TS];
Not only that but the size of local memory matches the workgroup size, in their example 32x32.
When I look at your kernel parameters for local memory, I can see that you use parameters that are defined as M and N in the original algorithm. This doesn't seem correct.
Update 1
Since you have not described if the original algorithm works for you, this is what you should do to find your error:
Create a set of testdata. Make sure you only use data sizes that are actually supported by the original algorithm (e.g. minimum size, mulitples of x, etc.). Also, use large data sets since some errors only show if multiple workgroups are dispatched.
Use the original, unaltered algorithm with your testdata sets and verify the results.
Change the algorithm only that instead of fixed size local memory, dynamic local memory size is used, but make sure it has the same size as the fixed size approach. This is what you tried but I think it failed due to what I have described under "Your changes".

OpenCL trying to use semaphore crashes drivers

While writing simple OpenCL kernel I tried to use semaphores and it crushed my GPU Drivers (AMD 12.10). After checking out examples I found out, that crash happens only when local work size is not equal to 1.
This code taken from example:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
void GetSemaphor(__global int * semaphor)
{
int occupied = atom_xchg(semaphor, 1);
while(occupied > 0)
{
occupied = atom_xchg(semaphor, 1);
}
}
void ReleaseSemaphor(__global int * semaphor)
{
int prevVal = atom_xchg(semaphor, 0);
}
__kernel void kernelNoAtomInc(__global int * num,
__global int * semaphor)
{
int i = get_global_id(0);
GetSemaphor(&semaphor[0]);
{
num[0]++;
}
ReleaseSemaphor(&semaphor[0]);
}
In example author uses
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 1 }, null);
Where N = global_work_size and local_work_size = 1
Now if I change 1 to null or 2 or 4 or any other number i tried - AMD drivers will crush.
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 2 }, null);
I do not have other PC to test on it at the moment. However it seems strange that author deliberately left local_group_size = 1, that's why I think I missing something here. Can someone please explain this to me? Also, as far as I understand, leaving local_group_size at 1 will affect performance greatly or it won't?
Thanks.
Host: Win8 x64, HD6870

Your problem is not reproducible and I can furthermore not find your source from the link, but here are a few ideas on why it could crash, which should be helpful (9 years in the past).
It propably crashes, because...
... the driver thinks you want the local version of that atom_xchg() function to be executed, when instead you want the global one.
... your loop slows down execution of that kernel so drastically on an old machine, that an internal limit of execution time got passed, causing the driver to terminate the kernel.
What I can suggest for a possible fix:
do not activate the local version of the atom function in your kernel
Try running it on CPU
There is no way to fix this, unless we could access your computer and debug on it.
You were also asking, why the author chose the local_group_size of one. This is because the global work size needs to be divisible by the local work size, such that the division results in a natural number. Dividing a natural number by one always results in a natural number, therefor this is perfect for experimenting. You are completely correct by saying that it will affect performance greatly. (Just maybe the maths didn't add up and it didn't crash, but not even start)
Different notes:
To make the incrementing be functionally correct, you should use an atom_inc() on your num buffer. I don't see how this could lead to a crash, but it definitely makes your program not work as intended
I would go and use the atomic functions from the 2.0 standard, since they already feature a semaphore-like functions: bool atomic_flag_test_and_set(volatile atomic_flag *object) and void atomic_flag_clear(volatile atomic_flag *object)

enqueueWriteImage fail on GPU

I am developing some kernels which works with image buffers. The problem is that when I create my Image2D by directly copying the data of the image, everything works well.
If I try to enqueue a write to my image buffer, it won't works for my GPU.
Here is a basic kernel :
__kernel void myKernel(__read_only image2d_t in, __write_only image2d_t out) {
const int x = get_global_id(0);
const int y = get_global_id(1);
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
uint4 pixel = read_imageui(in, sampler, (int2)(x, y));
write_imageui(out, (int2)(x, y), pixel);
}
Well, that simple kernel give me a black image on my GPU, but works well on my CPU.
To make it works, I have to do release the buffer image and creating a new one by directly passing data using CL_MEM_COPY_HOST_PTR.
I use the good data format : CL_RGBA, CL_UNSIGNED_INT8 and the size of my image is good.
The problem has been encountered with JOCL and the C++ binding of the API. (I didn't test the C API).
Finally, it runs by recreating the buffer, but is it a good idea ? Is it just normal ? Which actions can I perform to avoid it ?
By the way, I'm running on Intel SDK for OpenCL (Intel Core I7) and ATI AMD APP SDK (HD6800).
[edit]
Here is the code I use to write in my buffers.
First, the allocation part :
cl_image_format imageFormat = new cl_image_format();
imageFormat.image_channel_order = CL_RGBA;
imageFormat.image_channel_data_type = CL_UNSIGNED_INT8;
inputImageMem = clCreateImage2D(
context, CL_MEM_READ_ONLY,
new cl_image_format[]{imageFormat}, imageSizeX, imageSizeY,
0, null, null);
And when running, called for each frame, the part which doesn't work on GPU :
clEnqueueWriteImage(commandQueue, inputImageMem, CL_TRUE, new long[]{0, 0, 0},
new long[]{imageSizeX, imageSizeY, 1}, 0, 0,
Pointer.to(data), 0, null, null);
The part which works on both GPU and CPU but force me to recreate the buffer :
clReleaseMemObject(inputImageMem);
cl_image_format imageFormat = new cl_image_format();
imageFormat.image_channel_order = CL_RGBA;
imageFormat.image_channel_data_type = CL_UNSIGNED_INT8;
inputImageMem = clCreateImage2D(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, new cl_image_format[]{imageFormat}, imageSizeX, imageSizeY, 0, Pointer.to(data), null);
The data sent is an array of int of size imageSizeX*imageSizeY. I get it by this code :
DataBufferInt dataBuffer = (DataBufferInt)image.getRaster().getDataBuffer();
int data[] = dataBuffer.getData();
The code above is in java using JOCL, the same problem appear in an another C++ program using the C++ OpenCL Wrapper. The only differences are that in Java the virtual machine crash (after 3~4 frames) and in C++ the result is a black image.

Well, I found the problem. That was my drivers acting weird.
I was using the 12.4 version (the version I installed when I began to work with OpenCL) and I just installed the 12.6 version and the problem just disappeared.
So, keep your drivers up to date !

Wrong answer on AMD hardware using OpenCL

I'm using OpenCL with an AMD video card and have the latest driver in Linux.
When I do something like:
int a = get_group_id(0) > 0 ? vector[ get_group_id(0)-1 ].word[ id ] : 0;
I get a wrong result. But if use a barrier(CLK_LOCAL_MEM_FENCE); after this, I get the correct results.
Why is that happening?
Ps1.: Using a NVIDIA video card on both Linux or Windows, I get the right answer without using a barrier.
The block is( using __global *input, __global *output ):
int a = get_group_id(0) > 0 ? vector[ get_group_id(0)-1 ].word[ id ] : 0;
int b = get_group_id(0) > 0 ? c + a : a;
output[b + id] = input[ d + id ]; //Last kernel line
I'm using 128 workgroup size. I've tried in HD 6790 - linux
Thanks

Seems similar to this bug that I reported earlier: http://devgurus.amd.com/thread/158479
So I'm afraid it's a bug in the compiler and there's not much you can do, other than use your local barrier and wait until AMD fixes their stuff.
(Note that, as suggested in the linked topic, a local mem_fence should actually be enough to prevent the compiler from making this error.)

OpenCL random kernel behaviour when certain system size is exceeded

I am having a problem like this. Basically, I have a 2D grid allocated on host:
double* grid = (double*)malloc(sizeof(double)*(ny*nx) * 9);
Folllowing normal openCL procedure to put it on the openCL device:
cl_mem cl_grid = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, sizeof(double) * (ny*nx) * 9, grid, &error);
And Enqueue and launch:
clEnqueueNDRangeKernel(queue, foo, 1, NULL, &global_ws, &local_ws, 0, NULL, NULL);
In the kernel function, simple arithmetic is performed on the 1st column of the grid:
__kernel void foo(__constant ocl_param* params, __global double* grid)
{
const int ii = get_global_id(0);
int jj;
jj=0;
if (ii < params->ny) {
grid[getIndexUsingMacro(ii,jj)] += params->someNumber;
}
}
And finally read back the buffer and check values.
clEnqueueReadBuffer(queue, cl_grid, CL_TRUE, 0, sizeof(double) * 9 * nx * ny, checkGrid, 0, NULL, NULL);
The problem is when the grid size (i.e. nx * ny * 9) exceeds 16384 * 9 * 8 bytes = 1152KB (* 8 since double precision is used).
if using openCL on CPU, an error CL_OUT_OF_RESOURCES is thrown when launching the kernel no matter what I set for global_ws and local_ws (I set them to 1 and the error is still thrown). The CPU is an Intel i5 2415m with 8GB of RAM and 3MB cache.
If using openCL on the GPU (NVIDIA TESLA M2050), no error is thrown. However, when reading back the value from the buffer, the grid is not changed at all. It means it returns the grid whose values are exactly the same as before it is sent to the kernel function.
For e.g. When I set nx = 30, ny = 546, nx*ny = 16380, everything runs fine. The grid returned with the results changed as expected. But when ny = 547, nx* ny = 16410, the problem occurs both on CPU and GPU as described above. The problem is the same if I swap nx and ny, hence, if nx = 547, ny = 30, it still happens. Can you guys suggest what might be the problem here ?
Many thanks

It looks like a synchronization issue. grid[index] += value with the same index value may be executed concurrently by several work items. This operation is not atomic, and all these work items will load grid[index], add their value, and store it back, possibly losing some increments in the process.
To solve this, you can synchronize these work items using barrier if they are in a single work group, or enqueuing more kernels otherwise.
Another possibility is to ensure only one work item is able to modify a given element of the grid (usually the best solution).
If several work items need to work on a common subset of the grid, using local memory and local memory barriers may be useful.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex