I've noticed that often, global and constant device memory is initialized to 0. Is this a universal rule? I wasn't able to find anything in the standard.
No it doesn't. For instance I had this small kernel to test atomic add:
kernel void atomicAdd(volatile global int *result){
atomic_add(&result[0], 1);
}
Calling it with this host code (pyopencl + unittest):
def test_atomic_add(self):
NDRange = (4, 4)
result = np.zeros(1, dtype=np.int32)
out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY, size=result.nbytes)
self.prog.atomicAdd(self.queue, NDRange, NDRange, out_buf)
cl.enqueue_copy(self.queue, result, out_buf).wait()
self.assertEqual(result, 16)
was always returning the correct value when using my CPU. However on a ATI HD 5450 the returned value was always junk.
And If I well recall, on an NVIDIA the first run was returning the correct value, i.e. 16, but for the following run, the values were 32, 48, etc. It was reusing the same location with the old value still stored there.
When I corrected my host code with this line (copying the 0 value to the buffer):
out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY | self.mf.COPY_HOST_PTR, hostbuf=result)
Everything worked fine on any devices.
As far as I know there is no sentence in standard that states this.
Maybe some driver implementations will do this automatically, but you shoudn't rely on it.
I remember that once I had a case where a buffer was not initialized to 0, but I can't remember the settings of "OS + driver".
Probably what is going on is that the typical OS does not use even 1% of now a days devices memory. So when you start a OpenCL, there is a huge probability that you will fall into an empty zone.
It depends on the platform that you are developing. As #DarkZeros mentioned in the previous answers, the spec does not imply anything. Please see page 104 of OpenCL 2.1 Spec.
However, based on our experience in Mali GPUs, the driver initializes all elements of the newly allocated buffers to zero. This is for the first touch. Later on, as time goes by, and we release this buffer and its memory space is occupied by a new buffer, that memory space is not initialized with zero. "Again, the first touch sees the zero values. After that, you would see normal gibberish values."
Hope this helps after such long time!
Related
I would like to have a set of dummy addresses as flag values that can never be a valid pointer.
For example, if I knew that pointers 0xffff0000 through 0xffffffff where always invalid I could do something like this in C
enum {
SIZE_TOO_SMALL = 0xffff0001,
SIZE_TOO_LARGE = 0xffff0002,
SIZE_EVEN = 0xffff0003,
};
char* allocate_odd_arry(int size) {
if (size % 2 == 0)
return SIZE_EVEN;
if (size < 100)
return SIZE_TOO_SMALL;
if (size > 1000)
return SIZE_TOO_LARGE;
return malloc(size);
}
A silly example, but potentially powerful since it removes the need of sending an extra flag variable. One way I could do this is to allocate a few bytes myself and use those addresses as flags, but that comes with a small memory cost for each unique flag I use.
I don't expect a portable solution, but is there any guarantee on windows, linux, macos, that the addressable space will not include certain values?
For windows I have found this article which says that on 32 bits systems the virtual address space is 0x00000000 to 0x7fffffff, and for 64 bit systems it is 0x0000000000000000 to 0x00007fffffffffff. I am not sure if other addresses have any reserved meaning, but they ought to be safe for this use case.
Looking at Linux the answer seems a bit more complicated because (like everything else in linux) it is configurable. This answer on unix SE shows how memory is divided between the kernel and user space. 0000_8000_0000_0000 to ffff_7fff_ffff_ffff is listed as non canonical, which I think means it should never be used. Though really the kernel space (ffff_8000_0000_0000 to ffff_ffff_ffff_ffff) seems like it ought to be safe to use as well, but I'm less sure if there could never be a system function that returns such a pointer.
On Mac OS I've found this article which puts the virtual memory range as
0 to 0x0007_FFFF_FFFF_F000 (64 bit) or 0 to 0xFFFF_F000 (32 bit), so outside of these ranges would be fine.
Seems there is a little bit of overlap between all of the unused regions, so if you wanted to target all three platforms with the same address it would be possible. I'm still not 100% confident that these addresses are really truly safe to use on the respective OS, so I'm still holding out for anyone more knowledgeable to chime in.
According to the clGetKernelWorkGroupInfo documentation (from here), I tried to query the work group size & private memory size used by my kernel. Tested the below snippet on an Android device with adreno 530 GPU.
(Code sample from Apple OpenCL tutorial)
size_t maxWorkGroupSize;
cl_ulong private_mem_used;
clGetKernelWorkGroupInfo(kernel, &device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(maxWorkGroupSize), &maxWorkGroupSize, NULL );
clGetKernelWorkGroupInfo(kernel, &device, CL_KERNEL_PRIVATE_MEM_SIZE, sizeof(private_mem_used), &private_mem_used, NULL );
printf("Max work-group size is %ld \n", maxWorkGroupSize);
printf("Private memory used is %lld KB\n", private_mem_used/1024);
Output:
Max work-group size is 42773336
Private memory used is 179412930700111 KB
The output seems to be not correct.
If the output is not correct, is there anything wrong in the snippet?
If the output is correct, it will be helpful if you could help in interpreting the output
Your problem with wrong values seems to be resolved in the comment of user #pmdj.
I'm referring here to why you may seemingly always get value 0 returned for parameter name CL_KERNEL_PRIVATE_MEM_SIZE.
The things is, values returned for parameter name CL_KERNEL_PRIVATE_MEM_SIZE vary on different platforms.
On some they return the amount or private memory used - so the amount of bytes needed to store all variables in registers. Note that compiler does optimizations so it does not have to equal to the sum of your variables' sizes.
On other platforms it returns the amount or private memory spilled. So if you use too many variables and you exceed the size of your registers the compiler has to spill memory into caches or global memory. You can monitor this value when you make changes to your kernel. If it starts spilling, it's likely your kernel will become slower - often much slower. The amount spilled can be reported in bytes or in registers amount - which can be amount of 32-bit words.
I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.
I have noticed a number of kernel sources that look like this (found randomly by Googling):
__kernel void fill(__global float* array, unsigned int arrayLength, float val)
{
if(get_global_id(0) < arrayLength)
{
array[get_global_id(0)] = val;
}
}
My question is if that if-statement is actually necessary (assuming that "arrayLength" in this example is the same as the global work size).
In some of the more "professional" kernels I have seen, it is not present. It also seems to me that the hardware would do well to not assign kernels to nonsense coordinates.
However, I also know that processors work in groups. Hence, I can imagine that some processors of a group must do nothing (for example if you have 1 group of size 16, and a work size of 41, then the group would process the first 16 work items, then then next 16, then the next 9, with 7 processors not doing anything--do they get dummy kernels?).
I checked the spec., and the only relevant mention of "get_global_id" is the same as the online documentation, which reads:
The global work-item ID specifies the work-item ID based on the number of global work-items specified to execute the kernel.
. . . based how?
So what is it? Is it safe to omit iff the array's size is a multiple of the work group size? What?
You have the right answer already, I think. If the global size of your kernel execution is the same as the array length, then this if statement is useless.
In general, that type of check is only needed for cases where you've partitioned your data in such a way that you know you might execute extra work items relative to your array size. In my experience, you can almost always avoid such cases.
I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.