What memory addresses can never point to a valid object - pointers

I would like to have a set of dummy addresses as flag values that can never be a valid pointer.
For example, if I knew that pointers 0xffff0000 through 0xffffffff where always invalid I could do something like this in C
enum {
SIZE_TOO_SMALL = 0xffff0001,
SIZE_TOO_LARGE = 0xffff0002,
SIZE_EVEN = 0xffff0003,
};
char* allocate_odd_arry(int size) {
if (size % 2 == 0)
return SIZE_EVEN;
if (size < 100)
return SIZE_TOO_SMALL;
if (size > 1000)
return SIZE_TOO_LARGE;
return malloc(size);
}
A silly example, but potentially powerful since it removes the need of sending an extra flag variable. One way I could do this is to allocate a few bytes myself and use those addresses as flags, but that comes with a small memory cost for each unique flag I use.
I don't expect a portable solution, but is there any guarantee on windows, linux, macos, that the addressable space will not include certain values?

For windows I have found this article which says that on 32 bits systems the virtual address space is 0x00000000 to 0x7fffffff, and for 64 bit systems it is 0x0000000000000000 to 0x00007fffffffffff. I am not sure if other addresses have any reserved meaning, but they ought to be safe for this use case.
Looking at Linux the answer seems a bit more complicated because (like everything else in linux) it is configurable. This answer on unix SE shows how memory is divided between the kernel and user space. 0000_8000_0000_0000 to ffff_7fff_ffff_ffff is listed as non canonical, which I think means it should never be used. Though really the kernel space (ffff_8000_0000_0000 to ffff_ffff_ffff_ffff) seems like it ought to be safe to use as well, but I'm less sure if there could never be a system function that returns such a pointer.
On Mac OS I've found this article which puts the virtual memory range as
0 to 0x0007_FFFF_FFFF_F000 (64 bit) or 0 to 0xFFFF_F000 (32 bit), so outside of these ranges would be fine.
Seems there is a little bit of overlap between all of the unused regions, so if you wanted to target all three platforms with the same address it would be possible. I'm still not 100% confident that these addresses are really truly safe to use on the respective OS, so I'm still holding out for anyone more knowledgeable to chime in.

Related

Operate only on a subset of buffer in OpenCL kernel

Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?
It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.

OpenCL - Understanding the output of CL_KERNEL_WORK_GROUP_SIZE and CL_KERNEL_PRIVATE_MEM_SIZE

According to the clGetKernelWorkGroupInfo documentation (from here), I tried to query the work group size & private memory size used by my kernel. Tested the below snippet on an Android device with adreno 530 GPU.
(Code sample from Apple OpenCL tutorial)
size_t maxWorkGroupSize;
cl_ulong private_mem_used;
clGetKernelWorkGroupInfo(kernel, &device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(maxWorkGroupSize), &maxWorkGroupSize, NULL );
clGetKernelWorkGroupInfo(kernel, &device, CL_KERNEL_PRIVATE_MEM_SIZE, sizeof(private_mem_used), &private_mem_used, NULL );
printf("Max work-group size is %ld \n", maxWorkGroupSize);
printf("Private memory used is %lld KB\n", private_mem_used/1024);
Output:
Max work-group size is 42773336
Private memory used is 179412930700111 KB
The output seems to be not correct.
If the output is not correct, is there anything wrong in the snippet?
If the output is correct, it will be helpful if you could help in interpreting the output
Your problem with wrong values seems to be resolved in the comment of user #pmdj.
I'm referring here to why you may seemingly always get value 0 returned for parameter name CL_KERNEL_PRIVATE_MEM_SIZE.
The things is, values returned for parameter name CL_KERNEL_PRIVATE_MEM_SIZE vary on different platforms.
On some they return the amount or private memory used - so the amount of bytes needed to store all variables in registers. Note that compiler does optimizations so it does not have to equal to the sum of your variables' sizes.
On other platforms it returns the amount or private memory spilled. So if you use too many variables and you exceed the size of your registers the compiler has to spill memory into caches or global memory. You can monitor this value when you make changes to your kernel. If it starts spilling, it's likely your kernel will become slower - often much slower. The amount spilled can be reported in bytes or in registers amount - which can be amount of 32-bit words.

Does OpenCL always zero-initialize device memory?

I've noticed that often, global and constant device memory is initialized to 0. Is this a universal rule? I wasn't able to find anything in the standard.
No it doesn't. For instance I had this small kernel to test atomic add:
kernel void atomicAdd(volatile global int *result){
atomic_add(&result[0], 1);
}
Calling it with this host code (pyopencl + unittest):
def test_atomic_add(self):
NDRange = (4, 4)
result = np.zeros(1, dtype=np.int32)
out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY, size=result.nbytes)
self.prog.atomicAdd(self.queue, NDRange, NDRange, out_buf)
cl.enqueue_copy(self.queue, result, out_buf).wait()
self.assertEqual(result, 16)
was always returning the correct value when using my CPU. However on a ATI HD 5450 the returned value was always junk.
And If I well recall, on an NVIDIA the first run was returning the correct value, i.e. 16, but for the following run, the values were 32, 48, etc. It was reusing the same location with the old value still stored there.
When I corrected my host code with this line (copying the 0 value to the buffer):
out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY | self.mf.COPY_HOST_PTR, hostbuf=result)
Everything worked fine on any devices.
As far as I know there is no sentence in standard that states this.
Maybe some driver implementations will do this automatically, but you shoudn't rely on it.
I remember that once I had a case where a buffer was not initialized to 0, but I can't remember the settings of "OS + driver".
Probably what is going on is that the typical OS does not use even 1% of now a days devices memory. So when you start a OpenCL, there is a huge probability that you will fall into an empty zone.
It depends on the platform that you are developing. As #DarkZeros mentioned in the previous answers, the spec does not imply anything. Please see page 104 of OpenCL 2.1 Spec.
However, based on our experience in Mali GPUs, the driver initializes all elements of the newly allocated buffers to zero. This is for the first touch. Later on, as time goes by, and we release this buffer and its memory space is occupied by a new buffer, that memory space is not initialized with zero. "Again, the first touch sees the zero values. After that, you would see normal gibberish values."
Hope this helps after such long time!

Using strconv for negative hex values in Go

I've been building an assembler for no good reason the past day or so using Go so I can get familiar with the language. It's my first real program using Go so I expected problems, but I have a consistent bug coming up time and time again. I just figured out other hacky ways to fix it in other cases, but this time I think I need an answer so I feel like I'm actually doing this right.
Basically, I have to parse tons of byte values. Some of these are signed bytes so -1 = 0xFF and so on. When calculating the address of a label I need to find the offset of it from the current address. The following code is a stripped down basic version of what I use to get the offset:
// lbladdr holds the target label address
// address holds current address in memory
// label[x] holds the offset
if address > lbladdr {
lbladdr -= address
}
label[x] = strconv.FormatInt(int64(lbladdr), 16)
This works for positive values, but when I get a negative address (address > lbladdr) then instead of getting a value like FE I get -2. I don't get why the standard library would append a negative sign to a hex number and I haven't been able to find anything in the documentation about it. I've looked a lot of other places but I can't seem to find anyone with the same problem either.
I hope it's just something on my end that is a simple fix.
It's perfectly reasonable to use a negative sign on hexadecimal numbers. I know that when working with assembly it's common to use the actual bitpattern for the register you are representing in hex to represent the signs. However Go doesn't know you are doing that. Neither is go's formatting function written to support hex values as they would be in a CPU register. Further the bitpatterns will differ depending on the register size (16 vs 32 vs 64 and big vs little endian). you would be storing them in. So the base isn't enough to print them the way you want. You will need to write your own formatting lib that supports formatting for the type of Register you want to represent.
It's by design: http://golang.org/src/pkg/strconv/itoa.go?s=628:668#L8
What you may want is to cast to uint64:
package main
import (
"fmt"
"strconv"
)
func main() {
i := -1
fmt.Printf("%x\n", uint64(i))
fmt.Println(strconv.FormatUint(uint64(i), 16))
}

How do I find the base address and size of the stack on MacOS X?

I'm porting an imprecise garbage collector from Windows to MacOS X. In it, it has to scan the stack to identify potential pointers into the heap, and then use those as GC roots. To do this, I need the stack's base as well as it's length. In Windows, this code uses an algorithm similar to what's described here:
Stack and Stack Base Address
How do I do that on Mac OS X? Note that, for now, I only care about the main thread. The interpreter that uses this GC is single threaded and I can guarantee that no references exist on other threads./
You could also get the stack's total size and length with the Darwin-specific functions:
pthread_t self = pthread_self();
void* addr = pthread_get_stackaddr_np(self);
size_t size = pthread_get_stacksize_np(self);
printf("addr=%p size=%zx\n", addr, size);
Hans Boehm's conservative GC for C runs on MacOS X, and is open-source. So you could conceivably have a look at the source code of that GC to see how it locates the stack.
Alternatively, depending on how much you control the calling code, you may simply take the address of a local variable somewhere "high" (e.g. in the main() function or its MacOS X equivalent, or in the starting function for the relevant thread). Possibly, you might be able to simply choose the stack address and size upon thread creation (with Posix threads, this is done with pthread_attr_setstack() -- Posix threads can be used with MacOS X).

Resources