Is it possible to compile with -qopt-zmm-usage=high and set only one method to -qopt-zmm-usage=low. Disable Z register in one loop - intel

Using the intel compiler to compile a class, e.g. MyClass.h MyClass.cpp
Using the following compiler flags
-O3 -qopt-zmm-usage=high
If the intel compiler heuristics are found to be incorrect for one loop and its performance is actually higher without vectorization then vectorization can be disabled marking the loop with the #pragma novector pragma.
Is there an equivalent to only enable XY instructions, i.e. to disable the Z register?

Related

OpenCL: maintaining separate version of kernels

The Intel SDK says:
If you need separate versions of kernels, one way to keep the source
code base same, is using the preprocessor to create CPU-specific or
GPU-specific optimized versions of the kernels. You can run
clBuildProgram twice on the same program object, once for CPU with
some flag (compiler input) indicating the CPU version, the second time
for GPU and corresponding compiler flags. Then, when you create two
kernels with clCreateKernel, the runtime has two different versions
for each kernel.
Let us say I use clBuildProgram twice with flags for CPU and GPU. This will compile two versions of program one optimized for CPU and another optimized for GPU. But how will I create two kernels now, since there is not CPU/GPU specific option in clCreateKernel()?
The sequence of calls to build the kernel for CPU- and GPU devices and obtain the different kernels could look like this:
cl_program program = clCreateProgramWithSource(...)
clBuildProgram(program, numCpuDevices, cpuDeviceList, cpuOptions, NULL, NULL);
cl_kernel cpuKernel = clCreateKernel(program, ...);
clBuildProgram(program, numGpuDevices, gpuDeviceList, gpuOptions, NULL, NULL);
cl_kernel gpuKernel = clCreateKernel(program, ...);
(Note: I could not test this at the moment. If there's something wrong, I'll delete this answer)
clCreateKernel creates an entry point to a program, and the program has already been compiled for an specific device (CPU or GPU). So, there is nothing that you can do at the create kernel level if the program is already compiled in one or the other way.
By passing different compiled program objects, clCreateKernel will create different kernel objects for different devices.
The key to control the GPU/CPU mode is at the clBuildProgram step, where a device has to be specified.
Additionally the compilation can be further refined with external defines to disable/enable pieces of code specifically designed for CPU/GPU.
You would create only kernel with the same name. To discriminate between devices you would use the #ifdef queries inside the kernel, i.e.:
kernel void foo(global float *bar)
{
#ifdef HAVE_CPU
bar[0] = 23.0;
#elif HAVE_GPU
bar[0] = 42.0;
#endif
}
You can obtain this flag by
program.build({device}, "-DHAVE_CPU")
or -DHAVE_GPU. Remark: -D... is not a typo.

What are kernel blocks in OpenCL?

In the article "How to set up Xcode to run OpenCL code, and how to verify the kernels before building" NeXTCoder referred to some code as the "Short Answer", i.e. https://developer.apple.com/library/mac/#documentation/Performance/Conceptual/OpenCL_MacProgGuide/XCodeHelloWorld/XCodeHelloWorld.html.
In that code the author says "Wrap your kernel code into a kernel block:" without explaining what is a "kernel block". (The OpenCL Programmer Guide for Mac OS X by Apple makes no mention of kernel block.)
The host program calls "square_kernel" but the sample kernel is called "square" and the sample kernel block is labelled "kernelName" (in italics). Can you please tell me how to put the 3 pieces together:kernel, kernel block & host program to run in Xcode 5.1? I only have one kernel. Thanks.
It's not really jargon. It's closure-like entity.
OpenCL C 2.0 adds support for the clang block syntax. You use the ^ operator to declare a Block variable and to indicate the beginning of a Block literal. The body of the Block itself is contained within {}, as shown in the example (as usual with C, ; indicates the end of the statement).The Block is able to make use of variables from the same scope in which it was defined.
Example:
int multiplier = 7;
int (^myBlock)(int) = ^(int num) {
return num * multiplier;
};
printf(ā€œ%d\nā€, myBlock(3));
// prints 21
Source:
https://www.khronos.org/registry/cl/sdk/2.1/docs/man/xhtml/blocks.html
The term "kernel block" only seems to be a jargon to refer to the "part of the code that is the kernel". Particularly, the kernel block in this case is simply the function that is declared to be a kernel, by adding kernel before its declaration. Or, even simpler, and from the way how the term is used on this website, I would say that "kernel block" is the same as "kernel".
The kernelName (in italics) is a placeholder. The code there shows the general pattern of how to define any kernel:
It is prefixed with kernel
It returns void
It has a name ... the kernelName, which may for example be square
It has several input- and output parameters
The reason why the kernel is called square, but invoked with square_kernel seems to be some magic that is done by XCode: It seems to read the .cl file, and creates a .h file that contains additional declarations that are derived from the .cl file (as can be seen in this question, where a kernel called rebound is defined, and GCL generated a rebound_kernel declaration).

Fortran symbol not in load table (unable to call loaded symbol in R)

I am trying to build a Fortran DLL with Absoft Pro Fortran 13.0.3, 64 bits, for use within R, on Windows 7 64 bits.
Here is my file mycalc.f (it's a dumb example, just to test functionality):
subroutine mycalcf(a,b,c)
real*8 a,b,c
dll_export mycalcf
c=a+b*b
end
The statement dll_export is not standard, but is found in some Fortran compilers (AFAIK it's also found in Lahey and CVF and Intel Fortran has a compiler directive instead). It just tells the compiler which symbols are to be exported.
I compile successfuly with:
af90 -m64 -dll -YDLL_NAMES=LCS mycalc.f -o mycalc.dll
The option -YDLL_NAMES=LCS tells the compiler to build a library with lowercase symbols, which seems better for R.
If I run dumpbin /exports mycalc.dll, I can find mycalcf in exported symbols, in lowercase, without any underscore before or after.
Now, from R (64 bit version), the following works:
dyn.load("mycalc.dll")
is.loaded("mycalcf")
.Fortran("mycalcf", a=4, b=5, c=0)
I get c=29 on return, as expected.
BUT, if I restart R, the following does not work (notice I only removed the is.loaded test):
dyn.load("mycalc.dll")
.Fortran("mycalcf", a=4, b=5, c=0)
I get the error: Fortran symbol name "mycalcf" not in load table.
Now my question is: why is this test so important?
For comparison, when I try the same with gfortran instead of Absoft, I have no problem at all. I compile with: gfortran -m64 -shared -o mycalc2.dll mycalc.f (after commenting out the dll_export statement, which is not needed, nor even recognized, by gfortran).
Then in R:
dyn.load("mycalc2.dll")
.Fortran("mycalcf", a=4, b=5, c=0)
And I get c=29, no error.
Now, I suspect there is something the gcc linker does that is not done automatically by the Absoft linker (actually it's Microsoft's link.exe). But I have no hint what it can be.
Any idea is welcome!
Ok, solution after a good question from Vladimir F (see comments).
Actually, one must append underscores to symbol names. Since there is no way to do it by compiler option, one needs CDEC$ directives (see HP or Intel documentation).
Here it's simply:
subroutine mycalcf(a,b,c)
CDEC$ attributes alias:'mycalcf_' :: mycalcf
real*8 a,b,c
dll_export mycalcf
c=a+b*b
end
Second solution, from Absoft forum: actually I was wrong from the very beginning. Contrary to what I thought, one need not use the dll_export statement, and it even introduced the problem: without it, the compiler appends underscores. All symbols are exported by default, as in gfortran. So the correct code is simply:
subroutine mycalcf(a,b,c)
real*8 a,b,c
c=a+b*b
end
There is even no need for any option to get lowercase symbols, it's also the defaults.
However, one question remains: does the R function .Fortran always add underscore (is there a way to tell it not to?), and if it's always added, why does the call work when is.loaded is called beforehand? R seems to be doing something weird here.
I tried to track is.loaded in R source code (up to do_isloaded in src\main\dotcode.c), to no avail.

MPI - one function for MPI_Init and MPI_Init_thread

Is it possible to have one function to wrap both MPI_Init and MPI_Init_thread? The purpose of this is to have a cleaner API while maintaining backward compatibility. What happens to a call to MPI_Init_thread when it is not supported by the MPI run time? How do I keep my wrapper function working for MPI implementations when MPI_Init_thread is not supported?
MPI_INIT_THREAD is part of the MPI-2.0 specification, which was released 15 years ago. Virtually all existing MPI implementations are MPI-2 compliant except for some really archaic ones. You might not get the desired level of thread support, but the function should be there and you should still be able to call it instead of MPI_INIT.
You best and most portable option is to have a configure-like mechanism probe for MPI_Init_thread in the MPI library, e.g. by trying to compile a very simple MPI program and see if it fails with an unresolved symbol reference, or you can directly examine the export table of the MPI library with nm (for archives) or objdump (for shared ELF objects). Once you've determined that the MPI library has MPI_Init_thread, you can have a preprocessor symbol defined, e.g. CONFIG_HAS_INITTHREAD. Then have your wrapped similar to this one:
int init_mpi(int *pargc, char ***pargv, int desired, int *provided)
{
#if defined(CONFIG_HAS_INITTHREAD)
return MPI_Init_thread(pargc, pargv, desired, provided);
#else
*provided = MPI_THREAD_SINGLE;
return MPI_Init(pargc, pargv);
#endif
}
Of course, if the MPI library is missing MPI_INIT_THREAD, then MPI_THREAD_SINGLE and the other thread support level constants will also not be defined in mpi.h, so you might need to define them somewhere.

How to add a macro with clBuildProgram for OpenCL

In my kernel i have this defined.
#define ACTIVATION_FUNCTION(X) (1.7159f*tanh(2.0f/3.0f*X))
I would like to define it in the clBuildProgram call, such i can alter the kernel at runtime. How can i do this?
You can use the -D argument to the OpenCL compiler, by passing it in the options parameter of the clBuildProgram function. Passing -D x=y, is equivalent to adding #define x y at the top of your kernel file. Similarly, passing -D x is equivalent to adding #define x (for any x and y, of course).
In your case, you probably want to pass something like this:
-D ACTIVATION_FUNCTION(X)=(1.7159f*tanh(2.0f/3.0f*X))
Which you can then change as you see fit, directly from your program, at runtime.
Note you could also open the kernel file and write the define directly into it, as an alternative solution, but this is probably the cleanest way. Just be careful with newlines, I'm not sure how well they are handled.
Ref. this documentation page on clBuildProgram, "Preprocessor Options" section.

Resources