Most of the algorithms for parallel reduction uses shared(local) memory.
Nvidia,AMD, Intel and so on.
But if devices has doesn't have shared(local) memory.
How can I do it?
If i use same algorithms but store temporary value on global memory, is it gonna be work fine?
If the device supports OpenCL 2.0 then work_group_reduce can be used:
gentype work_group_reduce< op > ( gentype x)
The < op> in work_group_reduce_< op>, work_group_scan_exclusive_< op> and work_group_scan_inclusive_< op> defines the operator and can be add, min or max.
If I think about it, my comment already was the complete answer.
Yes, you can use global memory as a replacement for local memory but:
you have to allocate enough global memory for all workgroups and assign the workgroups their chunk of memory (since with local memory, you only have to specifiy as much memory as is needed for a single workgroup and each workgroup will allocate the amount of memory specified)
you have to use CLK_GLOBAL_MEM_FENCE instead of CLK_LOCAL_MEM_FENCE
you will lose a significant amout of performance
If I have time this evening, I will post a simple example.
Related
I want to generate a matrix which will be read by many thread after its generation so I declared it with program scope. It has to be constant so I am just assigning values once so
1) why openCl asking for initialization while declaration only?
2) How can I fix this issue?
1) Because you can't tell the gpu which elements are written by which threads. Constants are prepared by preprocessor using scalar engine, not parallel one. Parallel engine would need N x N times synchronizations to achieve that, where N is number of threads participating in building constant buffer.
2-a) If you want to work with constant memory, prepare a simple(__global, not constant) buffer in a kernel, use it as constant buffer in the next kernel(engine puts it in constant memory space). But constant space is small so the matrix should be small. This needs 2 kernels, means kernel overhead.
2-b) If cache performance is enough, just use a buffer. So it can be in a single kernel(first thread group prepares matrix, remaining ones compute using it, not starting until first group gives signal using atomic functions)
2-c) If local memory is bigger than constant memory, you can use local memory and build that matrix for each compute unit by themselves so it should take same amount of cycles(maybe even less if you use all cores) and probably faster than constant memory. This doesn't need communication between thread groups so would be fast.
2-d)If matrix is big and you need most of bandwidth, distribute it to all memory spaces. Example: put 1/4 of matrix to constant memory (5x bandwidth), put 1/4 of matrix to local memory (10x bandwidth), put 1/4 of matrix to global memory(2x from cache performance), put remaining data to instruction space(instructions themselves) so multiple threads would be working on 4 different places concurrently, using all bandwidth (constant + local + cache + instruction cache).
I am using two GPUs of same configuration for my HPC GPGPU calculation using OpenCL. One of the card is connected for the display purpose and about 200-300 MB of memory is always used by two programs called compiz and x server. My question is , when using these GPU's for computation I can use only a partial amount of total memory in GPU which is used for display purpose whereas the 2nd GPU I am able to use entire Global memory. In my case I am using two Nvidia Quadro 410, Which has 192 cuda cores , 512 MB as memory but 503 MB usable . In case of display GPU i can use only 128MB for computation and other I can use full 503 MB for calculation.
According to the The OpenCL Specification Page 32
Max size of memory obj
ect allocation
in bytes. The minimum value is max
(1/4
th
of
CL_DEVICE_GLOBAL_MEM_SIZE
,
128*1024*1024)
Also shouldn't this hold good for all the GPU's present in the System?
Just continue to read from that point, you will see
Max size of memory object allocation
in bytes. The minimum value is max
(1/4th of
CL_DEVICE_GLOBAL_MEM_SIZE ,
128*1024*1024)
so whichever is greater, 128MB or 1/4 of total; will be the limit.
OpenCL will automatically swap data out or it the GPU, so you are not actually limited to the GPU global memory, you can have more memory used, as long as you don use it all at once. You can "obviusly" not create objects such big that don fit on the GPU memory. That is where this limit kicks in.
The current max limit per object is as pointed out by #huseyin
CL_DEVICE_MAX_MEM_ALLOC_SIZE (cl_ulong)
Max size of memory object allocation in bytes. The minimum value is max
(1/4 th of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024)
minimun_max_global_size = max(1/4*global, 128MB)
If you read it carefuly, is the min value for the max allocation size.(tricky wording!).
Probably nVIDIA is setting it to 1/4 on a display GPU, and to whole memory size on the non-display GPU. But nVIDIA is following the spec by doing so in both cases.
It is something you should query, and operate within the limits the API reports. You cannot change it, and you should not guess it.
For an array X in the Global memory, I need to write two values in every Kernel execution.
X[p]=mul1+mul2;
X[p+a]=mul1-mul2;
Here 'a' can range from 0 to very high values. I observed that these two writes slow down my kernel to a great extent.
What is the best way to improve the memory write performance in OpenCL?
Are Coalesced memory writes possible only for intra Kernel writes?
Assuming p is linearly dependent from your thread ID, you are doing things the right way. You could try to pass X+aas a second argument to your kernel to do Y[p]=mul1-mul2; instead of X[p+a]=mul1-mul2; but I doubt it will be really faster.
Concerning your second question, if you are thinking of having two kernels, one performing the addition, the other the substraction and launch them concurrently, you cannot be sure they will be run side-by-side in parallel. Once again I doubt it will be faster in the end.
I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.
I am running functions which are deeply nested and consume quite a bit of memory as reported by the Windows task manager. The output variables are relatively small (1-2 orders of magnitude smaller than the amount of memory consumed), so I am assuming that the difference can be attributed to intermediate variables assigned somewhere in the function (or within sub-functions being called) and a delay in garbage collection. So, my questions are:
1) Is my assumption correct? Why or why not?
2) Is there any sense in simply nesting calls to functions more deeply rather than assigning intermediate variables? Will this reduce memory usage?
3) Suppose a scenario in which R is using 3GB of memory on a system with 4GB of RAM. After running gc(), it's now using only 2GB. In such a situation, is R smart enough to run garbage collection on its own if I had, say, called another function which used up 1.5GB of memory?
There are certain datasets I am working with which are able to crash the system as it runs out of memory when they are processed, and I'm trying to alleviate this. Thanks in advance for any answers!
Josh
1) Memory used to represent objects in R and memory marked by the OS as in-use are separated by several layers (R's own memory handling, when and how the OS reclaims memory from applications, etc.). I'd say (a) I don't know for sure but (b) at times the task manager's notion of memory use might not accurately reflect the memory actually in use by R, but that (c) yes, probably the discrepancy you describe reflects memory allocated by R to objects in your current session.
2) In a function like
f = function() { a = 1; g=function() a; g() }
invoking f() prints 1, implying that memory used by a is still being marked as in use when g is invoked. So nesting functions doesn't help with memory management, probably the reverse.
Your best bet is to clean-up or re-use variables representing large allocations before making more large allocations. Appropriately designed functions can help with this, e.g.,
f = function() { m = matrix(0, 10000, 10000); 1 }
g = function() { m = matrix(0, 10000, 10000); 1 }
h = function() { f(); g() }
The large memory of f is no longer needed by the time f returns, and so is available for garbage collection if the large memory required for g necessitates this.
3) If R tries to allocate memory for a variable and can't, it'll run its garbage collector a and try again. So you don't gain anything by running gc() yourself.
I'd make sure that you've written memory efficient code, and if there are still issues I'd move to a 64bit platform where memory is less of an issue.
R has facilities for memory profiling, but it needs to be built that. While we enable that for Debian / Ubuntu, I do not know what the default for Windows is.
Usage of memory profiling is discussed (briefly) in the 'Writing R Extensions' manual.
Coping with (limited) memory on a 32-bit system (and particularly Windows) has its challenges. Most people will recommend that you switch to a system with as much RAM as possible running a 64-bit OS.