Is there a utility toolkit for OpenCL? - opencl

Writing simple OpenCL kernels evolves repeating the following steps:
1. Put the kernel code in a string
2. call clCreateProgramWithSource
3. call clBuildProgram
4. call clCreateKernel
5. call clSetKernelArg (x number of arguments)
6. call clEnqueueNDRangeKernel
Is there a utility library that can make this process less painful, even in the cost of reduced flexibility? I am looking for something similar to GLUT / OpenGL for writing OpenCL programs

Check out Intel(R) SDK for OpenCL Applications https://software.intel.com/en-us/intel-opencl - it has tools to simplify OpenCL development quite a bit.

Related

Can I check OpenCL kernel syntax at compilation time?

I'm working on some OpenCL code within a larger project. The code only gets compiled at run-time - but I don't want to deploy a version and start it up just for that. Is there some way for me to have the syntax of those kernels checked (even without consider), or even compile them, at least under some restrictions, to make it easier to catch errors earlier?
I will be targeting AMD and/or NVIDIA GPUs.
The type of program you are looking for is an "offline compiler" for OpenCL kernels - knowing this will hopefully help with your search. They exist for many OpenCL implementations, you should check availability for the specific implementation you are using; otherwise, a quick web search suggests there are some generic open source ones which may or may not fit the bill for you.
If your build machine is also your deployment machine (i.e. your target OpenCL implementation is available on your build machine), you can of course also put together a very basic offline compiler yourself by simply wrapping clBuildProgram() and friends in a basic command line utility.

Getting detailed information about compiled OpenCL kernel on NVidia

Is there a way to get detailled information about how an OpenCL kernel was compiled on NVidia platforms (or on other platforms). Either external tools or tests that can be put into the kernel. Specifically:
Did vectorization succeed, and how are did the work items get grouped into warps?
If work items inside a work group go into different branches, did the compiler optimize it so that they still execute in parallel?
Did private memory variables get mapped to registers in the multiprocessor, or were they put into local/global memory? (Some architectures have more private memory per work group than local memory)
Can this information be seen in the PTX assembly output, or is this still higher level?
This is all compiler-level metadata; some of those are available through generic OpenCL API but the ones you request are way too low-level. Might be available through some Nvidia OpenCL extension though, i'm not familiar with those. Probably your best bet is finding some tools working on PTX level and feeding it the OpenCL program binaries.
You can always just generate PTX assembly and look into that:
program.build("-cl-fast-relaxed-math");
cout << program.getInfo<CL_PROGRAM_BINARIES>()[0] << endl;
In PTX you see exactly how the compiler translated the OpenCL code. Find the PTX documentation here.

R Parallel Processing with Xeon Phi, minimal code changes?

Looking at buying a couple Xeon Phi 5110P, but trying to estimate how much code I have to change or other software needed.
Currently I make good use of R on a multi-core Windows machine (24 cores) by using the foreach package, passing it other packages forecast, glmnet, etc. to do my parallel processing.
Having a Xeon Phi I understand I would want to compile R
https://software.intel.com/en-us/articles/running-r-with-support-for-intel-xeon-phi-coprocessors And I understand this could be done with a trail version of Parallel Studio XE.
Then do I then need to edit R's Makeconf file, adding the C/C++ flags and for the Phi? Compile all the needed packages before the trail on Parallel Studio expires? Or do I not need to edit the Makeconf to get the benefits of foreach on the Phi?
Seems like some of this will be handled automatically once R is compiled, with offloading done by the Math Kernel Library (MKL), but I'm not totally sure of this.
Somewhat related question: Is the Intel Xeon Phi usable without a costly Intel Compiler?
Also revolutionanalytics.com seems to have a few related blog posts but not entirely conclusive for me: http://blog.revolutionanalytics.com/2015/05/behold-the-power-of-parallel.html
If all you need is matrix operations, you can compile it with MKL libraries per here: [Running R with Support for Intel® Xeon Phi™ Coprocessors][1] , which requires the Intel Complier. Microsoft R comes pre compiled with MKL but I was not able to use the auto offload, I had to compile R with the Intel compiler for it to work properly.
You could use the trial version compiler and compile it during the trial period to see if it fits your purpose.
If you want to use things like foreach package by setting up a cluster,since each node is a linux computer, I'm afraid you're out of luck. On page 3 of [R-Admin][1] it says
Cross-building is not possible: installing R builds a minimal version of R and then runs many
R scripts to complete the build.
You have to cross compile from xeon host for xeon phi node with the intel compiler, and it's just not feasible.
The last way to utilize the Phi is to rewrite your code to call it directly. Rcpp provides an easy interface to C and C++ routines. If you found a C routine that runs well on xeon you can call the nodes within your code. I have done this with CUDA and the Rcpp is a thin layer and there are good examples of how to use it, and if you join it with examples of calling the phi card nodes you can probably achieve your goal with less overhead.
BUt, if all you need is matrix ops, there is no quicker route to supercomputing than a good double precision nvidea card and pre loading nvBlas during R startup.

clBuildProgram crash on NVidia cards

I have an OpenCL application that runs fine when using an AMD GPU.
When using an NVidia card, the clBuildProgram call crashes the application (does not even return a failure value, just a crash). When debugging, the crash yields:
read access violation in the nvopencl.dll module. code 0xc0000005. The debugger indicates the clGetExportTable function (inside nvopencl.dll) as source of the violation.
By commenting random parts of the kernels, I have reached this point:
In the code fragment:
if (something){
//some stuff
float3 gradient = (float3)(0,1,0);
gradient = normalize(gradient);
return;
}
By deleting the "gradient = normalize(gradient);" line, the clBuildProgram does not crash, but letting it there, crashed the program. the gradient variable is not even used inside the kernel, so it is not related to any other part of it. And the normalize funcion by itself should not be the source of the problem, because it is used in other parts of the code.
I think it may be related to some driver bug. Because installing the latest CUDA version (6.5) makes the OpenCL Volume Rendering sample binaries distributed by NVidia to misbehave, while using a CUDA 6 installation make the Volume Rendering sample to work properly.
My code is related to volume rendering techniques, that is why I think that it may be related, but my problem appears with both CUDA 6.5 and CUDA 6 installations.
Have you experienced something similar? What could be the cause of the problem, and how can I handle it?
Thank you.
After further analysis, the problem seems to be a bug in the drivers, as Xapa mentioned.

Is there MPI stub library?

I have C source with MPI calls.
I wonder can I get sequential program from the source by linking with some MPI stub library? Where can I get this lib?
Most correctly-written MPI programs should not depend on the number of processes they use to get a correct answer -- eg, if you run them on one process (mpirun -np 1 ./a.out) they should still work. So you oughtn't need a stub library - just use MPI. (If for some reason you just don't want extraneous libraries kicking around, it's certainly possible to write stubs and link against them -- I did this back in the day when setting up MPI on my laptop was a huge PITA, you could use this as a starting point and add any functionality you need. But these days, fiddling with the stub library is probably going to be more work than just using an existing MPI implementation.)
If your MPI program does not currently work correctly on one processor, a stub library probably won't help; you'll need to find the special cases that it's not handling and fix them.
I don't think this is possible. Contrary to OpenMP, programs using MPI don't necessarily run or produce the same result when you simply take away the MPI part.
PETSc contains a stub MPI library that works for one process (ie serial) execution:
http://www.mcs.anl.gov/petsc/petsc-3.2/include/mpiuni/mpi.h

Resources