Is there a way to get detailled information about how an OpenCL kernel was compiled on NVidia platforms (or on other platforms). Either external tools or tests that can be put into the kernel. Specifically:
Did vectorization succeed, and how are did the work items get grouped into warps?
If work items inside a work group go into different branches, did the compiler optimize it so that they still execute in parallel?
Did private memory variables get mapped to registers in the multiprocessor, or were they put into local/global memory? (Some architectures have more private memory per work group than local memory)
Can this information be seen in the PTX assembly output, or is this still higher level?
This is all compiler-level metadata; some of those are available through generic OpenCL API but the ones you request are way too low-level. Might be available through some Nvidia OpenCL extension though, i'm not familiar with those. Probably your best bet is finding some tools working on PTX level and feeding it the OpenCL program binaries.
You can always just generate PTX assembly and look into that:
program.build("-cl-fast-relaxed-math");
cout << program.getInfo<CL_PROGRAM_BINARIES>()[0] << endl;
In PTX you see exactly how the compiler translated the OpenCL code. Find the PTX documentation here.
Related
I'm working on some OpenCL code within a larger project. The code only gets compiled at run-time - but I don't want to deploy a version and start it up just for that. Is there some way for me to have the syntax of those kernels checked (even without consider), or even compile them, at least under some restrictions, to make it easier to catch errors earlier?
I will be targeting AMD and/or NVIDIA GPUs.
The type of program you are looking for is an "offline compiler" for OpenCL kernels - knowing this will hopefully help with your search. They exist for many OpenCL implementations, you should check availability for the specific implementation you are using; otherwise, a quick web search suggests there are some generic open source ones which may or may not fit the bill for you.
If your build machine is also your deployment machine (i.e. your target OpenCL implementation is available on your build machine), you can of course also put together a very basic offline compiler yourself by simply wrapping clBuildProgram() and friends in a basic command line utility.
Writing simple OpenCL kernels evolves repeating the following steps:
1. Put the kernel code in a string
2. call clCreateProgramWithSource
3. call clBuildProgram
4. call clCreateKernel
5. call clSetKernelArg (x number of arguments)
6. call clEnqueueNDRangeKernel
Is there a utility library that can make this process less painful, even in the cost of reduced flexibility? I am looking for something similar to GLUT / OpenGL for writing OpenCL programs
Check out Intel(R) SDK for OpenCL Applications https://software.intel.com/en-us/intel-opencl - it has tools to simplify OpenCL development quite a bit.
This is a very strange situation. Why do I get error
CL_PLATFORM_NOT_FOUND_KHR
when I'm calling this function:
clGetPlatformIDs(0, NULL, &platformCount);
Earlier this error was not. I have installed the driver and SDK from Intel and Nvidia. Are there any suggestions?
Here is explained why such error can occur. clGetPlatformIDs returns CL_SUCCESS if the function is executed successfully and there are a non-zero number of platforms available. Otherwise it can return CL_PLATFORM_NOT_FOUND_KHR if the cl_khr_icd extension is enabled and no platforms are found.
You are in luck. Well sort of... Seeing this is 3 years later.
Disclaimer: I HAVE NO CLUE WHY THIS WORKS:
Machine: x64 windows 10.
Graphics Card: Geforce GTX 960
Total Failure To Load Library : LoadLibraryA( "OpenCL64.dll" )
WRONG (but loads) : LoadLibraryA( "C:/Program Files/NVIDIA Corporation/OpenCL/OpenCL64.dll" )
WRONG (but loads) : LoadLibraryA( "C:/Program Files/NVIDIA Corporation/OpenCL/OpenCL.dll" )
CORRECT: LoadLibraryA( "OpenCL.dll" )
Here is the really insideous thing: Both of my "WRONG" answers will let you
grab function pointers, but when you call clGetPlatformIDs the return status
will be 0xFFFFFC17 ( CL_PLATFORM_NOT_FOUND_KHR ).
Then you'll be examining your function call correctness.
Maybe you'll even look at the calling convention. Maybe you'll check
the header files and make sure there are not any typos there. And yet,
you are looking in all the wrong places because the original problem happened
more steps back than you think.
Because of this problem, I build into my programs code that reads a file:
"OPEN_CL_SEARCH_PATHS.TXT" so the user of the software can change what DLL file
the program attempts to load.
While I am here, I would also like to add that there seems to be a bug with the
driver that makes it so OpenCL <==> OpenGL sharing is NOT a zero-copy share and
is incredibly laggy. Now I've got to go figure out Vulkan to make my fractal
rendering engine even though OpenCL's abstraction better suits the problem.
It is probably important to note that I am NOT using an SDK or any
validation layers. In fact, I am not even using
windows.h.
I wrote assembly code to grab the address of GetProcAddress and LoadLibrary by navigating the PEB file. I am also not using cl.h or cl_platform.h.
I reconstruct the structs I need from the documentation. I am also not
bothering with prototypes for function signatures either. For example,
I call "clGetPlatformIDs" by casting it to type "F_03" and then
calling it that way.
typedef void* (F_03)( void, void*, void* );
My machine doesn't have GPU and so had to use hashcat with OpenCL for CPU alone. My machine was Intel core i3, so I have downloaded the OpenCL softwares from Intel website and installed manually and the error gone.
Source: https://youtu.be/AieYqNQ6ADM
I have an OpenCL application that runs fine when using an AMD GPU.
When using an NVidia card, the clBuildProgram call crashes the application (does not even return a failure value, just a crash). When debugging, the crash yields:
read access violation in the nvopencl.dll module. code 0xc0000005. The debugger indicates the clGetExportTable function (inside nvopencl.dll) as source of the violation.
By commenting random parts of the kernels, I have reached this point:
In the code fragment:
if (something){
//some stuff
float3 gradient = (float3)(0,1,0);
gradient = normalize(gradient);
return;
}
By deleting the "gradient = normalize(gradient);" line, the clBuildProgram does not crash, but letting it there, crashed the program. the gradient variable is not even used inside the kernel, so it is not related to any other part of it. And the normalize funcion by itself should not be the source of the problem, because it is used in other parts of the code.
I think it may be related to some driver bug. Because installing the latest CUDA version (6.5) makes the OpenCL Volume Rendering sample binaries distributed by NVidia to misbehave, while using a CUDA 6 installation make the Volume Rendering sample to work properly.
My code is related to volume rendering techniques, that is why I think that it may be related, but my problem appears with both CUDA 6.5 and CUDA 6 installations.
Have you experienced something similar? What could be the cause of the problem, and how can I handle it?
Thank you.
After further analysis, the problem seems to be a bug in the drivers, as Xapa mentioned.
I have C source with MPI calls.
I wonder can I get sequential program from the source by linking with some MPI stub library? Where can I get this lib?
Most correctly-written MPI programs should not depend on the number of processes they use to get a correct answer -- eg, if you run them on one process (mpirun -np 1 ./a.out) they should still work. So you oughtn't need a stub library - just use MPI. (If for some reason you just don't want extraneous libraries kicking around, it's certainly possible to write stubs and link against them -- I did this back in the day when setting up MPI on my laptop was a huge PITA, you could use this as a starting point and add any functionality you need. But these days, fiddling with the stub library is probably going to be more work than just using an existing MPI implementation.)
If your MPI program does not currently work correctly on one processor, a stub library probably won't help; you'll need to find the special cases that it's not handling and fix them.
I don't think this is possible. Contrary to OpenMP, programs using MPI don't necessarily run or produce the same result when you simply take away the MPI part.
PETSc contains a stub MPI library that works for one process (ie serial) execution:
http://www.mcs.anl.gov/petsc/petsc-3.2/include/mpiuni/mpi.h