Valgrind equivalent for OpenCL - opencl

To check memory access violations on the CPU, there's Valgrind/memcheck; for CUDA code on the GPU, there's CUDA memcheck. Is there an equivalent to these tools for OpenCL?

There is now an OpenCL device simulator called Oclgrind that works in a similar manner to Valgrind to provide a 'memcheck' feature (among other things).
It's open-source, and there are binary builds for various platform available. Like Valgrind it's not fast, but using it is very straightforward:
$ oclgrind ./myapp
Invalid write of size 4 at global memory address 0x3000000000010
Kernel: write_out_of_bounds
Entity: Global(4,0,0) Local(4,0,0) Group(0,0,0)
store i32 %tmp15, i32 addrspace(1)* %tmp19, align 4, !dbg !24
At line 4 of input.cl:
c[i] = a[i] + b[i]

Have you looked at http://github.com/KhronosGroup/webcl-validator? It takes your OpenCL kernel source and instruments it with bounds checking code. OOB reads/writes are currently discarded, but you could modify the instrumented kernel (or the tool itself) to make it report the access violation.

Related

Is it possible to compile with -qopt-zmm-usage=high and set only one method to -qopt-zmm-usage=low. Disable Z register in one loop

Using the intel compiler to compile a class, e.g. MyClass.h MyClass.cpp
Using the following compiler flags
-O3 -qopt-zmm-usage=high
If the intel compiler heuristics are found to be incorrect for one loop and its performance is actually higher without vectorization then vectorization can be disabled marking the loop with the #pragma novector pragma.
Is there an equivalent to only enable XY instructions, i.e. to disable the Z register?

Why does setting an initialization value prevent placing a variable on a GPU in TensorFlow?

I get an exception when I try to run the following very simple TensorFlow code, although I virtually copied it from the documentation:
import tensorflow as tf
with tf.device("/gpu:0"):
x = tf.Variable(0, name="x")
sess = tf.Session()
sess.run(x.initializer) # Bombs!
The exception is:
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to
node 'x': Could not satisfy explicit device specification '/device:GPU:0' because
no supported kernel for GPU devices is available.
If I change the variable's initial value to tf.zeros([1]) instead, everything works fine:
import tensorflow as tf
with tf.device("/gpu:0"):
x = tf.Variable(tf.zeros([1]), name="x")
sess = tf.Session()
sess.run(x.initializer) # Works fine
Any idea what's going on?
This error arises because tf.Variable(0, ...) defines a variable of element type tf.int32, and there is no kernel that implements int32 variables on GPU in the standard TensorFlow distribution. When you use tf.Variable(tf.zeros([1])), you're defining a variable of element type tf.float32, which is supported on GPU.
The story of tf.int32 on GPUs in TensorFlow is a long one. While it's technically easy to support integer operations running on a GPU, our experience has been that most integer operations actually take place on the metadata of tensors, and this metadata lives on the CPU, so it's more efficient to operate on it there. As a short-term workaround, several kernel registrations for int32 on GPUs were removed. However, if these would be useful for your models, it would be possible to add them as custom ops.
Source: In TensorFlow 0.10, the Variable-related kernels are registered using the TF_CALL_GPU_NUMBER_TYPES() macro. The current "GPU number types" are tf.float16, tf.float32, and tf.float64.

OpenCL: maintaining separate version of kernels

The Intel SDK says:
If you need separate versions of kernels, one way to keep the source
code base same, is using the preprocessor to create CPU-specific or
GPU-specific optimized versions of the kernels. You can run
clBuildProgram twice on the same program object, once for CPU with
some flag (compiler input) indicating the CPU version, the second time
for GPU and corresponding compiler flags. Then, when you create two
kernels with clCreateKernel, the runtime has two different versions
for each kernel.
Let us say I use clBuildProgram twice with flags for CPU and GPU. This will compile two versions of program one optimized for CPU and another optimized for GPU. But how will I create two kernels now, since there is not CPU/GPU specific option in clCreateKernel()?
The sequence of calls to build the kernel for CPU- and GPU devices and obtain the different kernels could look like this:
cl_program program = clCreateProgramWithSource(...)
clBuildProgram(program, numCpuDevices, cpuDeviceList, cpuOptions, NULL, NULL);
cl_kernel cpuKernel = clCreateKernel(program, ...);
clBuildProgram(program, numGpuDevices, gpuDeviceList, gpuOptions, NULL, NULL);
cl_kernel gpuKernel = clCreateKernel(program, ...);
(Note: I could not test this at the moment. If there's something wrong, I'll delete this answer)
clCreateKernel creates an entry point to a program, and the program has already been compiled for an specific device (CPU or GPU). So, there is nothing that you can do at the create kernel level if the program is already compiled in one or the other way.
By passing different compiled program objects, clCreateKernel will create different kernel objects for different devices.
The key to control the GPU/CPU mode is at the clBuildProgram step, where a device has to be specified.
Additionally the compilation can be further refined with external defines to disable/enable pieces of code specifically designed for CPU/GPU.
You would create only kernel with the same name. To discriminate between devices you would use the #ifdef queries inside the kernel, i.e.:
kernel void foo(global float *bar)
{
#ifdef HAVE_CPU
bar[0] = 23.0;
#elif HAVE_GPU
bar[0] = 42.0;
#endif
}
You can obtain this flag by
program.build({device}, "-DHAVE_CPU")
or -DHAVE_GPU. Remark: -D... is not a typo.

What are kernel blocks in OpenCL?

In the article "How to set up Xcode to run OpenCL code, and how to verify the kernels before building" NeXTCoder referred to some code as the "Short Answer", i.e. https://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/XCodeHelloWorld/XCodeHelloWorld.html.
In that code the author says "Wrap your kernel code into a kernel block:" without explaining what is a "kernel block". (The OpenCL Programmer Guide for Mac OS X by Apple makes no mention of kernel block.)
The host program calls "square_kernel" but the sample kernel is called "square" and the sample kernel block is labelled "kernelName" (in italics). Can you please tell me how to put the 3 pieces together:kernel, kernel block & host program to run in Xcode 5.1? I only have one kernel. Thanks.
It's not really jargon. It's closure-like entity.
OpenCL C 2.0 adds support for the clang block syntax. You use the ^ operator to declare a Block variable and to indicate the beginning of a Block literal. The body of the Block itself is contained within {}, as shown in the example (as usual with C, ; indicates the end of the statement).The Block is able to make use of variables from the same scope in which it was defined.
Example:
int multiplier = 7;
int (^myBlock)(int) = ^(int num) {
return num * multiplier;
};
printf(ā€œ%d\nā€, myBlock(3));
// prints 21
Source:
https://www.khronos.org/registry/cl/sdk/2.1/docs/man/xhtml/blocks.html
The term "kernel block" only seems to be a jargon to refer to the "part of the code that is the kernel". Particularly, the kernel block in this case is simply the function that is declared to be a kernel, by adding kernel before its declaration. Or, even simpler, and from the way how the term is used on this website, I would say that "kernel block" is the same as "kernel".
The kernelName (in italics) is a placeholder. The code there shows the general pattern of how to define any kernel:
It is prefixed with kernel
It returns void
It has a name ... the kernelName, which may for example be square
It has several input- and output parameters
The reason why the kernel is called square, but invoked with square_kernel seems to be some magic that is done by XCode: It seems to read the .cl file, and creates a .h file that contains additional declarations that are derived from the .cl file (as can be seen in this question, where a kernel called rebound is defined, and GCL generated a rebound_kernel declaration).

MPI - one function for MPI_Init and MPI_Init_thread

Is it possible to have one function to wrap both MPI_Init and MPI_Init_thread? The purpose of this is to have a cleaner API while maintaining backward compatibility. What happens to a call to MPI_Init_thread when it is not supported by the MPI run time? How do I keep my wrapper function working for MPI implementations when MPI_Init_thread is not supported?
MPI_INIT_THREAD is part of the MPI-2.0 specification, which was released 15 years ago. Virtually all existing MPI implementations are MPI-2 compliant except for some really archaic ones. You might not get the desired level of thread support, but the function should be there and you should still be able to call it instead of MPI_INIT.
You best and most portable option is to have a configure-like mechanism probe for MPI_Init_thread in the MPI library, e.g. by trying to compile a very simple MPI program and see if it fails with an unresolved symbol reference, or you can directly examine the export table of the MPI library with nm (for archives) or objdump (for shared ELF objects). Once you've determined that the MPI library has MPI_Init_thread, you can have a preprocessor symbol defined, e.g. CONFIG_HAS_INITTHREAD. Then have your wrapped similar to this one:
int init_mpi(int *pargc, char ***pargv, int desired, int *provided)
{
#if defined(CONFIG_HAS_INITTHREAD)
return MPI_Init_thread(pargc, pargv, desired, provided);
#else
*provided = MPI_THREAD_SINGLE;
return MPI_Init(pargc, pargv);
#endif
}
Of course, if the MPI library is missing MPI_INIT_THREAD, then MPI_THREAD_SINGLE and the other thread support level constants will also not be defined in mpi.h, so you might need to define them somewhere.

Resources