Bads results with gpu program [closed] - opencl

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I haven't got good results with an iterative equation solving.
I am using a 2D array with "size_y" rows with "size_x" elements for each row.
The problem is that the code only does one iteration because the error cumulative is equal to zero. This cumulative error is computed in the kernel code for each cell of the array.
Here are 2 parts of sources files of this solving :
kernel code :
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define min(a,b) a <= b ? a : b
// kernel code
const char *source =
"__kernel void line_compute(__global double diagx, __global double diagy,\
__global double weightx, __global double weighty, __global int size_x,\
__global double* tab_process, __global double* tab_new, __global double* r) {\
const unsigned int iy = get_global_id(0);\
const unsigned int ix = get_global_id(1);\
/* do computation */\
tab_process[iy*size_x+ix] = weighty *( tab_new[(iy-1)*size_x+ix] +\
tab_new[(i+1)*size_x+ix] + tab_new[iy*size_x+ix]*diagy)+\
weightx *( tab_new[iy*size_x+(ix-1)] + tab_new[iy*size_x+(ix+1)] + tab_new[iy*size_x+ix]*diagx) ; \
r[iy*size_x+ix] = 0;\
rk = tab_new[iy*size_x+ix] - tab_process[iy*size_x+ix];\
r[iy*size_x+ix] =r[iy*size_x+ix]+ rk * rk;\
tab_new[iy*size_x+ix] = tab_process[iy*size_x+ix]\
}";
At the execution, the cumulative error that I print with :
result = 0.0;
for(i=1;i<=size_x*size_y;i++)
{ result = result + r[i];
printf("r[%d]=%20.18f\n",i,r[i]);
}
printf("result=%f\n",result);
*error=result;
is equal to zero. That's why the code only does one iteration.
I don't understand where the probem is.
If anyone could see what's wrong.

Please, when posting questions to Stack Overflow, isolate only the relevant sections of the code (and format it properly). This one is too much for anyone to look at.
Besides prematurely updating the tab_new at the end of the kernel (you should do it only once after all threads have finished since neighbouring values are dependent), you have a syntax error in the kernel source:
tab_process[iy*size_x+ix] = weighty *( tab_new[(iy-1)*size_x+ix] +\
>>> tab_new[(i+1)*size_x+ix] <<< + tab_new[iy*size_x+ix]*diagy)+\
weightx *( tab_new[iy*size_x+(ix-1)] + tab_new[iy*size_x+(ix+1)] + tab_new[iy*size_x+ix]*diagx) ; \
You have mistakenly written i instead of iy. So the program would most likely not compile in clCreateProgramWithSource. Because you don't check the return code in ret, you miss that fact and then the following clCreateKernel and clEnqueueNDRangeKernel are also failing. With no kernel being executed the value of r_mem_obj remains the same as its initial value - all zeros, because it is a copy of r which, as a freshly allocated heap memory, is also all zeros (newly commited after a read fault pages on Linux are CoW mapped to a special all-zeros page in the kernel). Summing up all zeros gives zero.

Related

Random NaN and incorrect results with OpenCL kernel

I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.
The Kernel
I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:
__kernel
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
__global const float* A, __global const float* B, const float beta,
__global float* C, __local float* Asub, __local float* Bsub) {
const uint row = get_local_id(0);
const uint col = get_local_id(1);
const uint TS = get_local_size(0); // Tile size
const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)
// Initialise the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numtiles = K / TS;
for (int t = 0; t < numtiles; t++) {
const int tiledRow = TS * t + row;
const int tiledCol = TS * t + col;
Asub[col * TS + row] = A[tiledCol * M + globalRow];
Bsub[col * TS + row] = B[globalCol * K + tiledRow];
barrier(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < TS; k++) {
acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
}
Tile Size (TS) is now a value defined in the calling code, which looks like this:
// A, B and C are 2D matrices, their cl::Buffers have already been set up
// and values appropriately set.
kernel.setArg(0, (cl_int)nrowA);
kernel.setArg(1, (cl_int)ncolB);
kernel.setArg(2, (cl_int)ncolA);
kernel.setArg(3, alpha);
kernel.setArg(4, A_buffer);
kernel.setArg(5, B_buffer);
kernel.setArg(6, beta);
kernel.setArg(7, C_buffer);
kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));
cl::NDRange global(nrowA, ncolB);
cl::NDRange local(nrowA, ncolB);
status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);
The Problem
The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)
I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.
However, I can get this correct result maybe just 75% of the time.
Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.
I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.
What am I doing wrong here?
Things I have tried
I have verified that the data going into the kernels are correct.
I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)
I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.
Other points to note
I am using the C++ wrapper retrieved directly from the Github page.
To use the wrapper, I have defined CL_HPP_MINIMUM_OPENCL_VERSION 120 and CL_HPP_TARGET_OPENCL_VERSION 120.
I am compiling the kernels with the -cl-std=CL1.2 flag.
All cl::Buffers are created with only the CL_MEM_READ_WRITE flag.
I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
All 3 platforms exhibit the same problem with varying frequency.
This looks like a complicated one. There are several things to address and they won't fit into comments, so I'll post all this as an answer even though it does not solve your problem (yet).
I am baffled because if I have made an algorithmic or mathematical
mistake, the mistake should be consistent; instead I am getting
incorrect results only randomly.
Such a behavior is a typical indicator of race conditions.
I have tried to initialize both __local memory to 0.0, but this causes
all results to become wrong (but frankly, I'm not really sure how to
initialize it properly)
Actually this is a good thing. Finally we have some consistency.
Initializing local memory
Initializing local memory can be done using the work items, e.g. if you have a 1D workgroup of 16 items and your local memory consists of 16 floats, just do this:
local float* ptr = ... // your pointer to local memory
int idx = get_local_id(0); // get the index for the current work-item
ptr[idx] = 0.f; // init with value 0
barrier(CLK_LOCAL_MEM_FENCE); // synchronize local memory access within workgroup
If your local memory is larger, e.g. 64 floats, you will have to use a loop where each work item initializes 4 values, at least that is the most efficient way. However, no one will stop you from using every work item to initialize every value in the local memory, even though that is complete nonsense since you're essentially initializing it multiple times.
Your changes
The original algorithm looks like it is especially designed to use quadratic tiles.
__local float Asub[TS][TS];
__local float Bsub[TS][TS];
Not only that but the size of local memory matches the workgroup size, in their example 32x32.
When I look at your kernel parameters for local memory, I can see that you use parameters that are defined as M and N in the original algorithm. This doesn't seem correct.
Update 1
Since you have not described if the original algorithm works for you, this is what you should do to find your error:
Create a set of testdata. Make sure you only use data sizes that are actually supported by the original algorithm (e.g. minimum size, mulitples of x, etc.). Also, use large data sets since some errors only show if multiple workgroups are dispatched.
Use the original, unaltered algorithm with your testdata sets and verify the results.
Change the algorithm only that instead of fixed size local memory, dynamic local memory size is used, but make sure it has the same size as the fixed size approach. This is what you tried but I think it failed due to what I have described under "Your changes".

Graceful Underflow

I have been searching about this for so long, but i am not able to understand what this question means.
Question:
Write a program in any language to determine how your computer handles graceful
underflow.
I understand that a overflow condition is something like this:
if an integer can store a maximum value of x and if we assign a value of x+1, the value x+1 will be converted to the the lowest value the integer can hold. I understand that underflow is just the reverse.
How does it stand from High performance scientific computing / Linear algebra point of view ?
I have read this link , but i think it's the same underflow/ overflow stuff that i mentioned above. What does the graceful underflow stand for?
Okay,as per the link posted by #StoneBird in this link was particularly helpful. Here i have created a program in c that demonstrates the same.
#include <stdio.h>
#include <math.h>
int main(int argc, char **argv)
{
unsigned int s,e,m;
unsigned int* ptr;
float a,temp=0;
a=1;
float min=pow(2,-129);
while(a>min){
temp=a;
a=a/2;
}
printf("Value=%e\n",temp);
ptr=(unsigned int*)&temp;
s = *ptr >> 31;
e = *ptr & 0x7f800000;
e >>= 23;
m = *ptr & 0x07fffff;
printf("sign = %x\n",s);
printf("exponent = %x\n",e);
printf("mantissa = %x\n",m);
return 0;
}
Here the min variable is used to change the final number...i used min=pow(2,-129), pow(2,-128) and pow(2,-130) to see the results and the saw the Denormal number appear.This wiki page explains it all.

OpenCL matrix vector multiplication code gives correct and incorrect solutions from run to run

I am working on OpenCL code for sparse matrix operations and I find that it works when the code including the kernel is executed once or twice. But every few runs the answer is slightly off. Here is the very simple kernel I am using:
__kernel void dsmv( int N, __global int * IA,
__global int * JA, __global float * A,
__global float * X, __global float * Y){
int IBGN, ICOL, IEND, ii;
ICOL = get_global_id(0);
if(ICOL < N)
{
IBGN = JA[ICOL]-1;
IEND = JA[ICOL+1]-1-1;
for (ii = IBGN; ii <= IEND; ii++)
{
Y[IA[ii]-1] += A[ii]*X[ICOL];
}
}
}
I can also post the fortran code that uses this kernel. I am using FortranCL.
What could cause the multiplication to give different answers from run to run?
This line looks suspicious:
Y[IA[ii]-1] += A[ii]*X[ICOL];
It seems that two work items may increment the same memory location, so there is a potential race condition here, and since += is not an atomic operation this is a problem.
Unfortunately you can't use the built-in atomic_add instead because it doesn't support floats, but atomic_cmpxchg does, so you can use it to implement a floating-point atomic add - or just look at this existing implementation of an atomic add for floats.

Multi-gpu allocation through another function [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
Using CUDA, I want to allocate memory for different arrays, one for each GPU from a different function than main(), but I must have missed something in regard to pointer arithmetic. Here's what I thought,
void InitThisMemory(int***, int N, int Nout, size_t* pitch, int height, int width); // This function's purpose is to initialize A and the pitch
int main(void){
int** A;
int N = 10;
int NOut = 2;
int height = 2, width = 2;
size_t pitch;
InitThisMemory(&A, N, NOut, &pitch, height, width);
return 0;
}
InitThisMemory(int ***A, int N, int Nout, size_t* pitch, int height, int width){
int i;
*A = (int**)malloc(Nout * sizeof(int*));
for(i = 0;i < Nout;i++){
cudaSetDevice(i);
cudaMallocPitch((void**)&(*A[i]), &(*pitch), width, height);
}
}
Disclaimer: Not my actual code but this should reproduce the error. Let me know if I missed an allocation of a variable somewhere.
Why do I think that the problem is in the arithmetic? Simply because this works pretty well if Nout = 1 (which means that I am using only one device).
Any ideas?
Your bug, I think, is writing (void**)&(*A[i]) instead of (void **) (&(*A)[i]), but I recommend you refactor as follows:
use a local int ** variable to hold the malloc() return value;
use that local in your call to cudaMallocPitch();
pass back the malloc() return value only if all cudaMallocPitch() calls succeed.
If you do these things, then it will be simpler to write correct cleanup code in the event that one of the cudaMallocPitch() calls fails, and you needn't propagate the passback unless everything has succeeded.

forcing stack w/i 32bit when -m64 -mcmodel=small

have C sources that must compile in 32bit and 64bit for multiple platforms.
structure that takes the address of a buffer - need to fit address in a 32bit value.
obviously where possible these structures will use natural sized void * or char * pointers.
however for some parts an api specifies the size of these pointers as 32bit.
on x86_64 linux with -m64 -mcmodel=small tboth static data and malloc()'d data fit within the 2Gb range. data on the stack, however, still starts in high memory.
so given a small utility _to_32() such as:
int _to_32( long l ) {
int i = l & 0xffffffff;
assert( i == l );
return i;
}
then:
char *cp = malloc( 100 );
int a = _to_32( cp );
will work reliably, as would:
static char buff[ 100 ];
int a = _to_32( buff );
but:
char buff[ 100 ];
int a = _to_32( buff );
will fail the assert().
anyone have a solution for this without writing custom linker scripts?
or any ideas how to arrange the linker section for stack data, would appear it is being put in this section in the linker script:
.lbss :
{
*(.dynlbss)
*(.lbss .lbss.* .gnu.linkonce.lb.*)
*(LARGE_COMMON)
}
thanks!
The stack location is most likely specified by the operating system and has nothing to do with the linker.
I can't imagine why you are trying to force a pointer on a 64 bit machine into 32 bits. The memory layout of structures is mainly important when you are sharing the data with something which may run on another architecture and saving to a file or sending across a network, but there are almost no valid reasons that you would send a pointer from one computer to another. Debugging is the only valid reason that comes to mind.
Even storing a pointer to be used later by another run of your program on the same machine would almost certainly be wrong since where your program is loaded can differ. Making any use of such a pointer would be undefined abd unpredictable.
the short answer appears to be there is no easy answer. at least no easy way to reassign range/location of the stack pointer.
the loader 'ld-linux.so' at a very early stage in process activation gets the address in the hurd loader - in the glibc sources, elf/ and sysdeps/x86_64/ search out elf_machine_load_address() and elf_machine_runtime_setup().
this happens in the preamble of calling your _start() entry and related setup to call your main(), is not for the faint hearted, even i couldn't convince myself this was a safe route.
as it happens - the resolution presents itself in some other old school tricks... pointer deflations/inflation...
with -mcmodel=small then automatic variables, alloca() addresses, and things like argv[], and envp are assigned from high memory from where the stack will grow down. those addresses are verified in this example code:
#include <stdlib.h>
#include <stdio.h>
#include <alloca.h>
extern char etext, edata, end;
char global_buffer[128];
int main( int argc, const char *argv[], const char *envp )
{
char stack_buffer[128];
static char static_buffer[128];
char *cp = malloc( 128 );
char *ap = alloca( 128 );
char *xp = "STRING CONSTANT";
printf("argv[0] %p\n",argv[0]);
printf("envp %p\n",envp);
printf("stack %p\n",stack_buffer);
printf("global %p\n",global_buffer);
printf("static %p\n",static_buffer);
printf("malloc %p\n",cp);
printf("alloca %p\n",ap);
printf("const %p\n",xp);
printf("printf %p\n",printf);
printf("First address past:\n");
printf(" program text (etext) %p\n", &etext);
printf(" initialized data (edata) %p\n", &edata);
printf(" uninitialized data (end) %p\n", &end);
}
produces this output:
argv[0] 0x7fff1e5e7d99
envp 0x7fff1e5e6c18
stack 0x7fff1e5e6a80
global 0x6010e0
static 0x601060
malloc 0x602010
alloca 0x7fff1e5e69d0
const 0x400850
printf 0x4004b0
First address past:
program text (etext) 0x400846
initialized data (edata) 0x601030
uninitialized data (end) 0x601160
all access to/from the 32bit parts of structures must be wrapped with inflate() and deflate() routines, e.g.:
void *inflate( unsigned long );
unsigned int deflate( void *);
deflate() tests for bits set in the range 0x7fff00000000 and marks the pointer so that inflate() will recognize how to reconstitute the actual pointer.
hope that helps if anyone similarly must support structures with 32bit storage for 64bit pointers.

Resources