OpenCL double precision not working - opencl

So I have a project that I created on Mac Pro using double data type and it works perfect. Now I moved my project to MacBook Air and it started giving me
Exception
ERROR: clBuildProgram(-11)
error. Now the reason for this is that my MacBook Air does not support the double precision type with OpenCL. But on the contrary, I found this: OpenCL kernel error on Mac OSx. I applied the method in the answer like this:
cl_device_fp_config cfg;
clGetDeviceInfo(devicesIds[0], CL_DEVICE_DOUBLE_FP_CONFIG, sizeof(cfg), &cfg, NULL); // 0 is for the device number I guess?
printf("Double FP config = %llu\n", cfg);
And he explained that if the result is 0, then it means that double precision is not supported. But I get this result:
Double FP config = 63
I also tried this method:
if (!cfg) {
printf("Double precision not supported \n\n");
} else {
printf("Following double precision features supported:\n");
if(cfg & CL_FP_INF_NAN)
printf(" INF and NaN values\n");
if(cfg & CL_FP_DENORM)
printf(" Denormalized numbers\n");
if(cfg & CL_FP_ROUND_TO_NEAREST)
printf(" Round To Nearest Even mode\n");
if(cfg & CL_FP_ROUND_TO_INF)
printf(" Round To Infinity mode\n");
if(cfg & CL_FP_ROUND_TO_ZERO)
printf(" Round To Zero mode\n");
if(cfg & CL_FP_FMA)
printf(" Floating-point multiply-and-add operation\n\n");
}
And I got the following results:
Double FP config = 63
Following double precision features supported:
INF and NaN values
Denormalized numbers
Round To Nearest Even mode
Round To Infinity mode
Round To Zero mode
Floating-point multiply-and-add operation
What is going on here? Does my system support double precision with OpenCL or not? If yes, how do I enable and use it? If not, what are my alternatives?
Now I am very confused. First of all, I don't know if my MacBook Air supports double precision or not? Apparently it doesn't. But with the output, it seems like it does.
If it doesn't support double precision, then what should I do? Should I change everything in my project to float values? OR if it does, then how do I enable it? Because I followed a lot of tutorials and examples and none of them work. e.g. https://streamcomputing.eu/blog/2013-10-17/writing-opencl-code-single-double-precision/ but none of them seem to work.
EDIT:

Unfortunately double support is "optional" in OpenCL, some devices support it and some don't...
If a device supports double then the device extensions (CL_DEVICE_EXTENSIONS) will contain cl_khr_fp64 or cl_khr_fp64.
There is some example kernel code here to use floats when doubles aren't available.

Related

How do you use double in OpenCL on a MacPro?

I have a Mac Pro (Late 2013) and I want to do some math in double using OpenCL. When I was using Mavericks the CL_DEVICE_EXTENSIONS for my FirePro GPU only listed cl_APPLE_fp64_basic_ops so I couldn't use double math functions like exp(). I recently upgraded to Yosemite and now the proper cl_khr_fp64 is in the list of extensions but I still can't use exp for double. The error log shows that it's looking for an overloaded function and exp is available for float, float4, float8,... but not 64 bit. I have included the command to turn on fp64:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
Does anyone know what's going on? Why does the GPU say that cl_khr_fp64 is available but then I can't use all of it. I can +-*/ in double, but I could also do that before with just basic_ops. Is Apple lying to me that they upgraded support of fp64?
Strangely, on my CPU OpenCL also says that cl_khr_fp64 is also available but I can't use exp on the CPU either.
In OpenCL C, you should call them doubles not cl_khr_fp64s
e.g.
double pie = M_PI;
double2 two_pies = (double2){M_PI}; // or {M_PI,M_PI};

How to specify the multicycle constraint for all paths using certain clock enable (in Vivado)?

I'm designing a huge system in a FPGA, operating at system clock 320 MHz.
Certain operations must be performed at slower clock - 160MHz due to long critical paths.
I can introduce a clock enable signal, lets call it CE2, used by registers surrounding such long operations.
According to the old Xilinx documentation: http://www.xilinx.com/itp/xilinx10/books/docs/timing_constraints_ug/timing_constraints_ug.pdf (page 60), I can add a special constraint:
NET CE2 TNM = slow_exception;
NET clk TNM = normal;
TIMESPEC TS01 = PERIOD normal 8 ns;
TIMESPEC TS02 = FROM slow_exception TO slow_exception TS01*2;
defining such multicycle timing constraint.
Unfortunately the above case is not descrbide in newer versions of the documentation,
and especially in documentation for Vivado tools.
Does anybody know how the above problem should be solved in XDC file for Vivado?
The new way of doing multicycle constraints in Vivado specifies the number of cycles rather than the direct period.
You can also use datapath_only constraints for false paths and clock crossings, which are more directly akin to what you used in ISE
This is a datapath_only constraint:
create_clock -period 8.000 -name arbitraryClkName -waveform {0.000 4.000} [get_ports portName];
set_max_delay -from [get_pins {hierarchical_location_source/CLK}] -to [get_clocks arbitraryClkName] -datapath_only 16.000;
Here is an actual multicycle hold command:
set_multicycle_path -hold 2 -from [get_pins {hierarchical_location_source/CLK}] -to [get_pins {hierarchical_location_sink/D}];
Here is the constraint documentation for Vivado 2014.3; you can find multicycle path documentation on page 79:
http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_3/ug903-vivado-using-constraints.pdf

Cannot find overload for '/', '*', Failure when running an app on anything but iPhone 5s simulator

My game builds and runs successfully on the iPhone 5s simulator, but when I try on any other version I get the following two errors:
`Could not find an overload for '*' that accepts the supplied arguments`
`Could not find an overload for '/' that accepts the supplied arguments`
I'm writing my game entirely in Swift, and the deployment target is iOS 7.1
The die rolls in the picture are defined as
let lengthDiceroll = Double(arc4random()) / 0x100000000
let sideDiceroll = Int(arc4random_uniform(UInt32(4)))
Your problem is a difference between 32- and 64-bit architecture. Note that the target architecture you're compiling your target for is determined by the selected device—if you've got the iPhone 4S simulator selected as your target in Xcode, for example, you'll be building for 32 bit; if you've got the iPhone 5S simulator selected, you'll be building for 64-bit.
You haven't included enough code to help us figure out what exactly is going on (we'd need to know the types of the variable you're assigning to) but here's my theory. In your first error, sprite.speed is probably a CGFloat. CGFloat is 32-bit ("float") on 32-bit targets, 64-bit ("double") on 64-bit targets. So this, for example:
var x:CGFloat = Double(arc4random()) / 0x100000000
...will compile fine on a 64-bit target, because you're putting a double into a double. But when compiling for a 32-bit target, it'll give you the error that you're getting, because you're losing precision by trying to stuff a double into a float.
This will work on both:
var x:CGFloat = CGFloat(arc4random()) / 0x100000000
Your other errors are caused by the same issue (though again, I can't reproduce them accurately without knowing what type you've declared width and height as.) For example, this will fail to compile for a 32-bit architecture:
let lengthDiceroll = Double(arc4random()) / 0x100000000
let width:CGFloat = 5
var y:CGPoint = CGPointMake(width * lengthDiceroll, 0)
...because lengthDiceroll is a Double, so width * lengthDiceroll is a Double. CGPointMake takes CGFloat arguments, so you're trying to stuff a Double (64-bit) into a float (32-bit.)
This will compile on both architectures:
let lengthDiceroll = Double(arc4random()) / 0x100000000
let width:CGFloat = 5
var y:CGPoint = CGPointMake(width * CGFloat(lengthDiceroll), 0)
...or possibly better, declare lengthDiceroll as CGFloat in the first place. It won't be as accurate on 32-bit architectures, but that's sort of the point of CGFloat:
let lengthDiceroll = CGFloat(arc4random()) / 0x100000000
let width:CGFloat = 5
var y:CGPoint = CGPointMake(width * lengthDiceroll, 0)
I've experienced similar errors where debug builds work and release builds fail for example. My advice would be make all your types explicit:
let lengthDiceroll = Double(arc4random()) / Double(0x100000000)
I've also had similar problems with CGFloat and CGPoint, make sure you explicitly use CGFloat, e.g. CGFloat(2.0)

Using the extra 16 bits in 64-bit pointers

I read that a 64-bit machine actually uses only 48 bits of address (specifically, I'm using Intel core i7).
I would expect that the extra 16 bits (bits 48-63) are irrelevant for the address, and would be ignored. But when I try to access such an address I got a signal EXC_BAD_ACCESS.
My code is:
int *p1 = &val;
int *p2 = (int *)((long)p1 | 1ll<<48);//set bit 48, which should be irrelevant
int v = *p2; //Here I receive a signal EXC_BAD_ACCESS.
Why this is so? Is there a way to use these 16 bits?
This could be used to build more cache-friendly linked list. Instead of using 8 bytes for next ptr, and 8 bytes for key (due to alignment restriction), the key could be embedded into the pointer.
The high order bits are reserved in case the address bus would be increased in the future, so you can't use it simply like that
The AMD64 architecture defines a 64-bit virtual address format, of which the low-order 48 bits are used in current implementations (...) The architecture definition allows this limit to be raised in future implementations to the full 64 bits, extending the virtual address space to 16 EB (264 bytes). This is compared to just 4 GB (232 bytes) for the x86.
http://en.wikipedia.org/wiki/X86-64#Architectural_features
More importantly, according to the same article [Emphasis mine]:
... in the first implementations of the architecture, only the least significant 48 bits of a virtual address would actually be used in address translation (page table lookup). Further, bits 48 through 63 of any virtual address must be copies of bit 47 (in a manner akin to sign extension), or the processor will raise an exception. Addresses complying with this rule are referred to as "canonical form."
As the CPU will check the high bits even if they're unused, they're not really "irrelevant". You need to make sure that the address is canonical before using the pointer. Some other 64-bit architectures like ARM64 have the option to ignore the high bits, therefore you can store data in pointers much more easily.
That said, in x86_64 you're still free to use the high 16 bits if needed (if the virtual address is not wider than 48 bits, see below), but you have to check and fix the pointer value by sign-extending it before dereferencing.
Note that casting the pointer value to long is not the correct way to do because long is not guaranteed to be wide enough to store pointers. You need to use uintptr_t or intptr_t.
int *p1 = &val; // original pointer
uint8_t data = ...;
const uintptr_t MASK = ~(1ULL << 48);
// === Store data into the pointer ===
// Note: To be on the safe side and future-proof (because future implementations
// can increase the number of significant bits in the pointer), we should
// store values from the most significant bits down to the lower ones
int *p2 = (int *)(((uintptr_t)p1 & MASK) | (data << 56));
// === Get the data stored in the pointer ===
data = (uintptr_t)p2 >> 56;
// === Deference the pointer ===
// Sign extend first to make the pointer canonical
// Note: Technically this is implementation defined. You may want a more
// standard-compliant way to sign-extend the value
intptr_t p3 = ((intptr_t)p2 << 16) >> 16;
val = *(int*)p3;
WebKit's JavaScriptCore and Mozilla's SpiderMonkey engine as well as LuaJIT use this in the nan-boxing technique. If the value is NaN, the low 48-bits will store the pointer to the object with the high 16 bits serve as tag bits, otherwise it's a double value.
Previously Linux also uses the 63rd bit of the GS base address to indicate whether the value was written by the kernel
In reality you can usually use the 48th bit, too. Because most modern 64-bit OSes split kernel and user space in half, so bit 47 is always zero and you have 17 top bits free for use
You can also use the lower bits to store data. It's called a tagged pointer. If int is 4-byte aligned then the 2 low bits are always 0 and you can use them like in 32-bit architectures. For 64-bit values you can use the 3 low bits because they're already 8-byte aligned. Again you also need to clear those bits before dereferencing.
int *p1 = &val; // the pointer we want to store the value into
int tag = 1;
const uintptr_t MASK = ~0x03ULL;
// === Store the tag ===
int *p2 = (int *)(((uintptr_t)p1 & MASK) | tag);
// === Get the tag ===
tag = (uintptr_t)p2 & 0x03;
// === Get the referenced data ===
// Clear the 2 tag bits before using the pointer
intptr_t p3 = (uintptr_t)p2 & MASK;
val = *(int*)p3;
One famous user of this is the V8 engine with SMI (small integer) optimization. The lowest bit in the address will serve as a tag for type:
if it's 1, the value is a pointer to the real data (objects, floats or bigger integers). The next higher bit (w) indicates that the pointer is weak or strong. Just clear the tag bits and dereference it
if it's 0, it's a small integer. In 32-bit V8 or 64-bit V8 with pointer compression it's a 31-bit int, do a signed right shift by 1 to restore the value; in 64-bit V8 without pointer compression it's a 32-bit int in the upper half
32-bit V8
|----- 32 bits -----|
Pointer: |_____address_____w1|
Smi: |___int31_value____0|
64-bit V8
|----- 32 bits -----|----- 32 bits -----|
Pointer: |________________address______________w1|
Smi: |____int32_value____|0000000000000000000|
https://v8.dev/blog/pointer-compression
So as commented below, Intel has published PML5 which provides a 57-bit virtual address space, if you're on such a system you can only use 7 high bits
You can still use some work around to get more free bits though. First you can try to use a 32-bit pointer in 64-bit OSes. In Linux if x32abi is allowed then pointers are only 32-bit long. In Windows just clear the /LARGEADDRESSAWARE flag and pointers now have only 32 significant bits and you can use the upper 32 bits for your purpose. See How to detect X32 on Windows?. Another way is to use some pointer compression tricks: How does the compressed pointer implementation in V8 differ from JVM's compressed Oops?
You can further get more bits by requesting the OS to allocate memory only in the low region. For example if you can ensure that your application never uses more than 64MB of memory then you need only a 26-bit address. And if all the allocations are 32-byte aligned then you have 5 more bits to use, which means you can store 64 - 21 = 43 bits of information in the pointer!
I guess ZGC is one example of this. It uses only 42 bits for addressing which allows for 242 bytes = 4 × 240 bytes = 4 TB
ZGC therefore just reserves 16TB of address space (but not actually uses all of this memory) starting at address 4TB.
A first look into ZGC
It uses the bits in the pointer like this:
6 4 4 4 4 4 0
3 7 6 5 2 1 0
+-------------------+-+----+-----------------------------------------------+
|00000000 00000000 0|0|1111|11 11111111 11111111 11111111 11111111 11111111|
+-------------------+-+----+-----------------------------------------------+
| | | |
| | | * 41-0 Object Offset (42-bits, 4TB address space)
| | |
| | * 45-42 Metadata Bits (4-bits) 0001 = Marked0
| | 0010 = Marked1
| | 0100 = Remapped
| | 1000 = Finalizable
| |
| * 46-46 Unused (1-bit, always zero)
|
* 63-47 Fixed (17-bits, always zero)
For more information on how to do that see
Allocating Memory Within A 2GB Range
How can I ensure that the virtual memory address allocated by VirtualAlloc is between 2-4GB
Allocate at low memory address
How to malloc in address range > 4 GiB
Custom heap/memory allocation ranges
Side note: Using linked list for cases with tiny key values compared to the pointers is a huge memory waste, and it's also slower due to bad cache locality. In fact you shouldn't use linked list in most real life problems
Bjarne Stroustrup says we must avoid linked lists
Why you should never, ever, EVER use linked-list in your code again
Number crunching: Why you should never, ever, EVER use linked-list in your code again
Bjarne Stroustrup: Why you should avoid Linked Lists
Are lists evil?—Bjarne Stroustrup
A standards-compliant way to canonicalize AMD/Intel x64 pointers (based on the current documentation of canonical pointers and 48-bit addressing) is
int *p2 = (int *)(((uintptr_t)p1 & ((1ull << 48) - 1)) |
~(((uintptr_t)p1 & (1ull << 47)) - 1));
This first clears the upper 16 bits of the pointer. Then, if bit 47 is 1, this sets bits 47 through 63, but if bit 47 is 0, this does a logical OR with the value 0 (no change).
I guess no-one mentioned possible use of bit fields ( https://en.cppreference.com/w/cpp/language/bit_field ) in this context, e.g.
template<typename T>
struct My64Ptr
{
signed long long ptr : 48; // as per phuclv's comment, we need the type to be signed to be sign extended
unsigned long long ch : 8; // ...and, what's more, as Peter Cordes pointed out, it's better to mark signedness of bit field explicitly (before C++14)
unsigned long long b1 : 1; // Additionally, as Peter found out, types can differ by sign and it doesn't mean the beginning of another bit field (MSVC is particularly strict about it: other type == new bit field)
unsigned long long b2 : 1;
unsigned long long b3 : 1;
unsigned long long still5bitsLeft : 5;
inline My64Ptr(T* ptr) : ptr((long long) ptr)
{
}
inline operator T*()
{
return (T*) ptr;
}
inline T* operator->()
{
return (T*)ptr;
}
};
My64Ptr<const char> ptr ("abcdefg");
ptr.ch = 'Z';
ptr.b1 = true;
ptr.still5bitsLeft = 23;
std::cout << ptr << ", char=" << char(ptr.ch) << ", byte1=" << ptr.b1 <<
", 5bitsLeft=" << ptr.still5bitsLeft << " ...BTW: sizeof(ptr)=" << sizeof(ptr);
// The output is: abcdefg, char=Z, byte1=1, 5bitsLeft=23 ...BTW: sizeof(ptr)=8
// With all signed long long fields, the output would be: abcdefg, char=Z, byte1=-1, 5bitsLeft=-9 ...BTW: sizeof(ptr)=8
I think it may be quite a convenient way to try to make use of these 16 bits, if we really want to save some memory. All the bitwise (& and |) operations and cast to full 64-bit pointer are done by compiler (though, of course, executed in run time).
According to the Intel Manuals (volume 1, section 3.3.7.1) linear addresses has to be in the canonical form. This means that indeed only 48 bits are used and the extra 16 bits are sign extended. Moreover, the implementation is required to check whether an address is in that form and if it is not generate an exception. That's why there is no way to use those additional 16 bits.
The reason why it is done in such way is quite simple. Currently 48-bit virtual address space is more than enough (and because of the CPU production cost there is no point in making it larger) but undoubtedly in the future the additional bits will be needed. If applications/kernels were to use them for their own purposes compatibility problems will arise and that's what CPU vendors want to avoid.
Physical memory is 48 bit addressed. That's enough to address a lot of RAM. However between your program running on the CPU core and the RAM is the memory management unit, part of the CPU. Your program is addressing virtual memory, and the MMU is responsible for translating between virtual addresses and physical addresses. The virtual addresses are 64 bit.
The value of a virtual address tells you nothing about the corresponding physical address. Indeed, because of how virtual memory systems work there's no guarantee that the corresponding physical address will be the same moment to moment. And if you get creative with mmap() you can make two or more virtual addresses point at the same physical address (wherever that happens to be). If you then write to any of those virtual addresses you're actually writing to just one physical address (wherever that happens to be). This sort of trick is quite useful in signal processing.
Thus when you tamper with the 48th bit of your pointer (which is pointing at a virtual address) the MMU can't find that new address in the table of memory allocated to your program by the OS (or by yourself using malloc()). It raises an interrupt in protest, the OS catches that and terminates your program with the signal you mention.
If you want to know more I suggest you Google "modern computer architecture" and do some reading about the hardware that underpins your program.

What's the equivalent of rdtsc opcode for PPC?

I have an assembly program that has the following code.
This code compiles fine for a intel processor. But, when I use a PPC (cross)compiler, I get an error that the opcode is not recognized. I am trying to find if there is an equivalent opcode for PPC architecture.
.file "assembly.s"
.text
.globl func64
.type func64,#function
func64:
rdtsc
ret
.size func64,.Lfe1-func64
.globl func
.type func,#function
func:
rdtsc
ret
PowerPC includes a "time base" register which is incremented regularly (although perhaps not at each clock -- it depends on the actual hardware and the operating system). The TB register is a 64-bit value, read as two 32-bit halves with mftb (low half) and mftbu (high half). The four least significant bits of TB are somewhat unreliable (they increment monotonically, but not necessarily with a fixed rate).
Some of the older PowerPC processors do not have the TB register (but the OS might emulate it, probably with questionable accuracy); however, the 603e already has it, so it is a fair bet that most if not all PowerPC systems actually in production have it. There is also an "aternate time base register".
For details, see the Power ISA specification, available from the power.org Web site. At the time of writing that answer, the current version was 2.06B, and the TB register and opcodes were documented at pages 703 to 706.
When you need a 64-bit value on a 32-bit architecture (not sure how it works on 64-bit) and you read the TB register you can run into the problem of the lower half going from 0xffffffff to 0 - granted this doesn't happen often but you can be sure it will happen when it will do the most damage ;)
I recommend you read the upper half first, then the lower and finally the upper again. Compare the two uppers and if they are equal, no problemo. If they differ (the first should be one less than the last) you have to look at the lower to see which upper it should be paired with: if its highest bit is set it should be paired with the first, otherwise with the last.
Apple has three versions of mach_absolute_time() for the different types of code:
32-bit
64-bit kernel, 32-bit app
64-bit kernel, 64-bit app
Inspired by a comment from Peter Cordes and the disassembly of clang's __builtin_readcyclecounter:
mfspr 3, 268
blr
For gcc you can do the following:
unsigned long long rdtsc(){
unsigned long long rval;
__asm__ __volatile__("mfspr %%r3, 268": "=r" (rval));
return rval;
}
Or For clang:
unsigned long long readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
return __builtin_readcyclecounter();
}

Resources