How do you use double in OpenCL on a MacPro? - opencl

I have a Mac Pro (Late 2013) and I want to do some math in double using OpenCL. When I was using Mavericks the CL_DEVICE_EXTENSIONS for my FirePro GPU only listed cl_APPLE_fp64_basic_ops so I couldn't use double math functions like exp(). I recently upgraded to Yosemite and now the proper cl_khr_fp64 is in the list of extensions but I still can't use exp for double. The error log shows that it's looking for an overloaded function and exp is available for float, float4, float8,... but not 64 bit. I have included the command to turn on fp64:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
Does anyone know what's going on? Why does the GPU say that cl_khr_fp64 is available but then I can't use all of it. I can +-*/ in double, but I could also do that before with just basic_ops. Is Apple lying to me that they upgraded support of fp64?
Strangely, on my CPU OpenCL also says that cl_khr_fp64 is also available but I can't use exp on the CPU either.

In OpenCL C, you should call them doubles not cl_khr_fp64s
e.g.
double pie = M_PI;
double2 two_pies = (double2){M_PI}; // or {M_PI,M_PI};

Related

CL/cl.h not found in SYCL

I have just started working on SYCL and ran ComputeCpp_info on my system and following data on 3 devices is showed
ComputeCpp Info (CE 1.1.0)
SYCL 1.2.1 revision 3
Device 1 ( GeForce GTX 1050 = NO - Device does not support SPIR)
Device 2 (Intel(R) HD Graphics 630 = UNTESTED - Device not tested on this OS)
Device 3 (Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz = UNTESTED - Device running untested driver)
Now my question is can I work on these devices as 2 are untested and 1 is not possible? or am i missing some drivers?
Also I implemented a simple example but it gives me CL/cl.h not found error
#include <CL/sycl.hpp>
#include <array>
#include <numeric>
#include <iostream>
int main() {
const size_t array_size = 1024 * 512;
std::array<cl::sycl::cl_int, array_size> in, out;
std::iota(begin(in), end(in), 0);
cl::sycl::queue device_queue;
cl::sycl::range<1> n_items{ array_size };
cl::sycl::buffer < cl::sycl::cl_int, 1> in_buffer(in.data(), n_items);
cl::sycl::buffer < cl::sycl::cl_int, 1> out_buffer(out.data(), n_items);
device_queue.submit([&](cl::sycl::handler &cgh) {
constexpr auto sycl_read = cl::sycl::access::mode::read;
constexpr auto sycl_write = cl::sycl::access::mode::write;
auto in_accessor = in_buffer.get_access<sycl_read>(cgh);
auto out_accessor = out_buffer.get_access<sycl_write>(cgh);
cgh.parallel_for<class VecScalMul>(n_items,
[=](cl::sycl::id<1> wiID) {
out_accessor[wiID] = in_accessor[wiID] * 2;
});
});
}
The computecpp_info tool shows the devices that are or are not supported by ComputeCpp on your system. Here's an explanation of the outputs:
NO - Device does not support SPIR: This means that the device can be seen but it does not support SPIR instructions and so cannot be supported by ComputeCpp
UNTESTED - Device not tested on this OS: This means that the device can be seen and is reporting that it supports SPIR instructions. It should work with ComputeCpp but this specific device has not been tested by the ComputeCpp team
The cl.h header missing error is because you are missing the OpenCL headers. These can be found here and you'll need to point at them when you compile your code. I'd suggest using the Getting Started guide with the sample code and then modifying the hello world example to test out your code. This has an existing CMake file that is designed to search for all the dependencies you need.
(This is a ComputeCpp-specific question, rather than SYCL) The UNTESTED platforms will probably work, but Codeplay cannot guarantee it. In my experience, both should work, maybe some OpenCL driver bugs will hit you on Intel GPU depending on your configuration.
You need the OpenCL headers in your system, since SYCL 1.2.1 specification is built on top of OpenCL
Disclaimer: I am a Codeplay employee working in ComputeCpp!

OpenCL double precision not working

So I have a project that I created on Mac Pro using double data type and it works perfect. Now I moved my project to MacBook Air and it started giving me
Exception
ERROR: clBuildProgram(-11)
error. Now the reason for this is that my MacBook Air does not support the double precision type with OpenCL. But on the contrary, I found this: OpenCL kernel error on Mac OSx. I applied the method in the answer like this:
cl_device_fp_config cfg;
clGetDeviceInfo(devicesIds[0], CL_DEVICE_DOUBLE_FP_CONFIG, sizeof(cfg), &cfg, NULL); // 0 is for the device number I guess?
printf("Double FP config = %llu\n", cfg);
And he explained that if the result is 0, then it means that double precision is not supported. But I get this result:
Double FP config = 63
I also tried this method:
if (!cfg) {
printf("Double precision not supported \n\n");
} else {
printf("Following double precision features supported:\n");
if(cfg & CL_FP_INF_NAN)
printf(" INF and NaN values\n");
if(cfg & CL_FP_DENORM)
printf(" Denormalized numbers\n");
if(cfg & CL_FP_ROUND_TO_NEAREST)
printf(" Round To Nearest Even mode\n");
if(cfg & CL_FP_ROUND_TO_INF)
printf(" Round To Infinity mode\n");
if(cfg & CL_FP_ROUND_TO_ZERO)
printf(" Round To Zero mode\n");
if(cfg & CL_FP_FMA)
printf(" Floating-point multiply-and-add operation\n\n");
}
And I got the following results:
Double FP config = 63
Following double precision features supported:
INF and NaN values
Denormalized numbers
Round To Nearest Even mode
Round To Infinity mode
Round To Zero mode
Floating-point multiply-and-add operation
What is going on here? Does my system support double precision with OpenCL or not? If yes, how do I enable and use it? If not, what are my alternatives?
Now I am very confused. First of all, I don't know if my MacBook Air supports double precision or not? Apparently it doesn't. But with the output, it seems like it does.
If it doesn't support double precision, then what should I do? Should I change everything in my project to float values? OR if it does, then how do I enable it? Because I followed a lot of tutorials and examples and none of them work. e.g. https://streamcomputing.eu/blog/2013-10-17/writing-opencl-code-single-double-precision/ but none of them seem to work.
EDIT:
Unfortunately double support is "optional" in OpenCL, some devices support it and some don't...
If a device supports double then the device extensions (CL_DEVICE_EXTENSIONS) will contain cl_khr_fp64 or cl_khr_fp64.
There is some example kernel code here to use floats when doubles aren't available.

32-bit pointer overflow in 64-bit gcc code - fails in compile

I am compiling a very large legacy Fortran 90 code (screamer) with gFortran on a Mac (2.2 GHz Intel Core i7) running Yosemite. (gFortran V5.1.0) I have 16 GB of RAM. The code is memory intensive and I am trying to increase array sizes to solve larger problems. I have maintained the code for >10 years and rewriting 200,000 lines of code right now is not an option. As I carefully increase the size of the 2-D matrix (am(max_nodes, max_nodes)) and several 1-D vectors (RHS(max_nodes) and a(max_nodes*2)) by varying the integer "max_nodes" I eventually get to a 32-bit pointer limit (4 byte unsigned integer limit) during compilation. See below.
final section layout:
__TEXT/__text addr=0x100001390, size=0x0006B9CB, fileOffset=0x00001390, type=1
__TEXT/__text_startup addr=0x10006CD60, size=0x00000041, fileOffset=0x0006CD60, type=1
__TEXT/__text_exit addr=0x10006CDB0, size=0x00000031, fileOffset=0x0006CDB0, type=1
__TEXT/__stubs addr=0x10006CDE2, size=0x00000252, fileOffset=0x0006CDE2, type=28
__TEXT/__stub_helper addr=0x10006D034, size=0x000003EE, fileOffset=0x0006D034, type=32
__TEXT/__cstring addr=0x10006D428, size=0x0000CFCB, fileOffset=0x0006D428, type=13
__TEXT/__const addr=0x10007A400, size=0x00008F00, fileOffset=0x0007A400, type=0
__TEXT/__eh_frame addr=0x100083300, size=0x0000DCF8, fileOffset=0x00083300, type=19
__DATA/__got addr=0x100091000, size=0x00000060, fileOffset=0x00091000, type=29
__DATA/__nl_symbol_ptr addr=0x100091060, size=0x00000010, fileOffset=0x00091060, type=29
__DATA/__la_symbol_ptr addr=0x100091070, size=0x00000318, fileOffset=0x00091070, type=27
__DATA/__mod_init_func addr=0x100091388, size=0x00000010, fileOffset=0x00091388, type=33
__DATA/__mod_term_func addr=0x100091398, size=0x00000008, fileOffset=0x00091398, type=34
__DATA/__const addr=0x1000913A0, size=0x000007C8, fileOffset=0x000913A0, type=0
__DATA/__static_data addr=0x100091B68, size=0x00000003, fileOffset=0x00091B68, type=0
__DATA/__data addr=0x100091B80, size=0x000003E0, fileOffset=0x00091B80, type=0
__DATA/__bss4 addr=0x100091F60, size=0x00000018, fileOffset=0x00000000, type=25
__DATA/__bss5 addr=0x100091F80, size=0x00020000, fileOffset=0x00000000, type=25
__DATA/__bss3 addr=0x1000B1F80, size=0x00000028, fileOffset=0x00000000, type=25
__DATA/__pu_bss2 addr=0x1000B1FA8, size=0x00000008, fileOffset=0x00000000, type=25
__DATA/__bss2 addr=0x1000B1FB0, size=0x00000024, fileOffset=0x00000000, type=25
__DATA/__pu_bss5 addr=0x1000B1FE0, size=0x0000024C, fileOffset=0x00000000, type=25
__DATA/__pu_bss4 addr=0x1000B2230, size=0x00000018, fileOffset=0x00000000, type=25
__DATA/__common addr=0x1000B2260, size=0x000020D8, fileOffset=0x00000000, type=25
__DATA/__zo_bss3 addr=0x1000B4338, size=0x00000021, fileOffset=0x00000000, type=25
__DATA/__huge addr=0x1000B4360, size=0x984EB80C, fileOffset=0x00000000, type=25
ld: 32-bit RIP relative reference out of range (2147639505 max is +/-4GB): from _main_loop_ (0x10000E120) to _a.4206 (0x180034380) in '_main_loop_' from screamer64.a(main_loop.o) for architecture x86_64
collect2: error: ld returned 1 exit status
In this error message main_loop is the core solver subroutine in screamer that populates and solves the large matrices. In this subroutine the large real*8 matrix and real*8 vectors are defined.
This Register Instruction Pointer (RIP) error is noted many times on the web. So far this available information has not helped me solve my problem. Note: the signed 4 byte integer limit is 2,147,483,647 so the error seems to be directly related to the use of a 32-bit pointer.
The gFortran compiler options include -mcmodel=medium that should take the pointers to 64 bits. -m64 has no effect. The total memory used by the primary matrix and vectors when the pointer limit is reached is greater than 2.4 GB. The confusing thing is that the code is fully 64 bit so I was not expecting 32-bit pointers. See below for 64-bit check.
rbspielman$ file screamer64
screamer64: Mach-O 64-bit executable x86_64
The primary matrix and vector are all real*8 (64-bit). All large arrays in declared directly in this one subroutine and are not placed in common.
All other variables in common are ordered by size. real*8, real, int, char.
Simple test programs demonstrate that there is no fundamental memory limit. I can easily define static arrays to > 10 GB without a problem. Larger arrays also work but end up using virtual memory and slow down as expected.
Clearly there is some sort of memory or pointer size limit but I just cannot figure it out. The code matrix solvers are massive and more realistic test programs would be tedious.
(I also compile screamer in Ubuntu LINUX without a problem up to the same array limit as the Mac. Compilations in Windows 8 fail at the usual 2 GB memory limit NOT at the pointer limit.)
Suggestions would be appreciated.
I just ran into the same problem with GNU Fortran (GCC) 5.1.0 in a Mac running 10.11.5 but the solution offered by the OP did not work for me.
However, I did find a solution: after systematically pruning my rather pedestrian legacy code, I found that every array has to be explicitly filled with something. You can't start filling it within your code. I know it sounds silly, but once I initialized every "real" array (32 bit, it is legacy code) with 0.0 before I did any I/O or other work, it linked without complaint.
And, yes, as with the OP, my code worked until I changed the size of an array.
The reason why this worked may be in the contents of this bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63793 but I am not good enough to tell you how to come up with a better workaround. My only guess is that initializing every array at the beginning favors the GOT instead of the RIP. (When will this be fixed? I just don't know how to push this up the line and the bug report is dated 2014-11-09)

Why doesn't OpenCL Nvidia compiler (nvcc) use the registers twice?

I'm doing a small OpenCL benchmark using Nvidia drivers,
my kernel performs 1024 fuse multiply-adds and store the result in an array:
#define FLOPS_MACRO_1(x) { (x) = (x) * 0.99f + 10.f; } // Multiply-add
#define FLOPS_MACRO_2(x) { FLOPS_MACRO_1(x) FLOPS_MACRO_1(x) }
#define FLOPS_MACRO_4(x) { FLOPS_MACRO_2(x) FLOPS_MACRO_2(x) }
#define FLOPS_MACRO_8(x) { FLOPS_MACRO_4(x) FLOPS_MACRO_4(x) }
// more recursive macros ...
#define FLOPS_MACRO_1024(x) { FLOPS_MACRO_512(x) FLOPS_MACRO_512(x) }
__kernel void ocl_Kernel_FLOPS(int iNbElts, __global float *pf)
{
for (unsigned i = get_global_id(0); i < iNbElts; i += get_global_size(0))
{
float f = (float) i;
FLOPS_MACRO_1024(f)
pf[i] = f;
}
}
But when I look in the PTX generated, I see this:
.entry ocl_Kernel_FLOPS(
.param .u32 ocl_Kernel_FLOPS_param_0,
.param .u32 .ptr .global .align 4 ocl_Kernel_FLOPS_param_1
)
{
.reg .f32 %f<1026>; // 1026 float registers !
.reg .pred %p<3>;
.reg .s32 %r<19>;
ld.param.u32 %r1, [ocl_Kernel_FLOPS_param_0];
// some more code unrelated to the problem
// ...
BB1_1:
and.b32 %r13, %r18, 65535;
cvt.rn.f32.u32 %f1, %r13;
fma.rn.f32 %f2, %f1, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f3, %f2, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f4, %f3, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f5, %f4, 0f3F7D70A4, 0f41200000;
// etc
// ...
If I am correct, the PTX uses 1026 float registers to perform the 1024 operations and never reuse a register twice even if it could perform all the multiply-add operations using only 2 registers. 1026 is far above the maximum number of registers a thread is allow to have (according to the specs), so I guess this ends up in memory spilling.
Is it a compiler bug or am I totally missing something ?
I am using nvcc version 6.5 on a Quadro K2000 GPU.
EDIT
Actually I did miss something in the specs:
"Since PTX supports virtual registers, it is quite common for a compiler frontend to generate
a large number of register names. Rather than require explicit declaration of every name,
PTX supports a syntax for creating a set of variables having a common prefix string
appended with integer suffixes. For example, suppose a program uses a large number, say
one hundred, of .b32 variables, named %r0, %r1, ..., %r99"
The PTX file format is intended to describe a virtual machine and instruction set architecture:
PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programs are translated at install time to the target hardware instruction set. The PTX-to-GPU translator and driver enable NVIDIA GPUs to be used as programmable parallel computers.
So the PTX output that you are obtaining there is not a form of "GPU assembler". It is only an intermediate representation, intended to be capable of describing virtually any form of parallel computation.
The PTX representation is then compiled into actual binaries for the respective target GPU. This is important in order to be possible to abstract from the actual architecture - specifically, regarding your example: It should be possible to use the same PTX representation of a program, regardless of the number of registers that are available on a specific target machine. The 1026 "registers" that you see there are "virtual" registers, and in the end, may be mapped to the (few) real hardware registers that are actually available. You may add the --ptxas-options=-v argument to the NVCC during the compilation to obtain addition information about the register usage.
(This is roughly the same idea as that behind the LLVM - namely, to have a representation that can be optimized on and argued about, both abstracting from the original source code and from the actual target architecture).

What's the equivalent of rdtsc opcode for PPC?

I have an assembly program that has the following code.
This code compiles fine for a intel processor. But, when I use a PPC (cross)compiler, I get an error that the opcode is not recognized. I am trying to find if there is an equivalent opcode for PPC architecture.
.file "assembly.s"
.text
.globl func64
.type func64,#function
func64:
rdtsc
ret
.size func64,.Lfe1-func64
.globl func
.type func,#function
func:
rdtsc
ret
PowerPC includes a "time base" register which is incremented regularly (although perhaps not at each clock -- it depends on the actual hardware and the operating system). The TB register is a 64-bit value, read as two 32-bit halves with mftb (low half) and mftbu (high half). The four least significant bits of TB are somewhat unreliable (they increment monotonically, but not necessarily with a fixed rate).
Some of the older PowerPC processors do not have the TB register (but the OS might emulate it, probably with questionable accuracy); however, the 603e already has it, so it is a fair bet that most if not all PowerPC systems actually in production have it. There is also an "aternate time base register".
For details, see the Power ISA specification, available from the power.org Web site. At the time of writing that answer, the current version was 2.06B, and the TB register and opcodes were documented at pages 703 to 706.
When you need a 64-bit value on a 32-bit architecture (not sure how it works on 64-bit) and you read the TB register you can run into the problem of the lower half going from 0xffffffff to 0 - granted this doesn't happen often but you can be sure it will happen when it will do the most damage ;)
I recommend you read the upper half first, then the lower and finally the upper again. Compare the two uppers and if they are equal, no problemo. If they differ (the first should be one less than the last) you have to look at the lower to see which upper it should be paired with: if its highest bit is set it should be paired with the first, otherwise with the last.
Apple has three versions of mach_absolute_time() for the different types of code:
32-bit
64-bit kernel, 32-bit app
64-bit kernel, 64-bit app
Inspired by a comment from Peter Cordes and the disassembly of clang's __builtin_readcyclecounter:
mfspr 3, 268
blr
For gcc you can do the following:
unsigned long long rdtsc(){
unsigned long long rval;
__asm__ __volatile__("mfspr %%r3, 268": "=r" (rval));
return rval;
}
Or For clang:
unsigned long long readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
return __builtin_readcyclecounter();
}

Resources