I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%.
I wanted still more speed up so copied from local to private which degraded the performance
So is it correct that I think we must not use to much private memory which may degrade the performance?

Ashwin's answer is in the right direction but a little misleading.
OpenCL abstracts the address space of variables away from their physical storage, and there is not necessarily a 1:1 mapping between the two.
Consider OpenCL variables declared in the __private address space, which includes automatic non-pointer variables inside functions by default. The NVidia GPU implementation will physically allocate these in registers as far as possible, only spilling over to physical off-chip memory when there is insufficient register capacity. This particular off-chip memory is called "CUDA local" memory, and has similar performance characteristics to memory allocated for __global variables, which explains the performance penalty due to register spill-over. There is no such physical thing as "private memory" in this implementation, only a "private address space", which may be allocated on- or off-chip.
The performance hit is not a direct consequence of using the private address space (or "private memory"), which is typically allocated in high performance memory. It is because, under this implementation, the variable was too large to be allocated on high performance registers, and was therefore "spilled over" to off-chip memory.

According to "Heterogeneous Computing with OpenCL" (Revised OpenCL 1.2 Edition):
Private memory is memory that is unique to an individual work-item. Local variables
and nonpointer kernel arguments are private by default. In practice, these variables
are usually mapped to registers, although private arrays and any spilled
registers are usually mapped to an off-chip (i.e., long-latency) memory.
So, if you use a great deal of private memory, or use arrays in private memory, yes, it can be slower than local memory.

In (GPU-like) OpenCL devices, the local memory is on-chip and close to the processing elements (PE). It might be as fast as accessing L1 cache. The private memory for each thread is actually apportioned from off-chip global memory. This is far from from the PE and might have a latency of hundreds of clock cycles, thus degrading the read-write performance.

James Beilby's answer is the right direction but is a little bit out of the path:
Depending on implementation, it could be faster or slower because opencl doesn't force providers to use on-chip or off-chip memories but AMD is very good at OpenCL on price/performance dimension so I'll give some numbers about it.
Private memory in AMD implementation, is fastest(smallest latency,highest bandwidth like 22 TB/s for a mainstream gpu).
Here in appendix-d:
you can see register file, LDS, constant cache and global those are used for different name spaces when there is enough space for themselves. For example, register file has 22 TB/s and only about 300kB per compute unit. This has less latency and more bandwidth than LDS which is used for __local memory space. Total LDS size is even less than that (per compute unit).
If going from local to private doesnt do good, you should decrease local thread group size from 256 to 64 for example. SO more private registers availeable per thread.
So for this example AMD gpu, local memory is 15 times faster than global memory, private memory is 5 times faster than local memory. If it doesn't fit in private memory, it spills to global memory so only L1-L2 cache can help here. If data is not re-used much, no point of using private registers here. Just stream from global to global if only used once.
For some smartphone or a cpu, it could be very bad to use private registers because they could be mapped to something else.


Allocate a constant memory variable in local memory, only once, shared within its workgroup

I have an OpenCL application whose kernels all share two big chunks of constant memory. One of them is used to generate passwords, the other to test it.
The two subprograms are very fast when operating separately, but things slow to a halt when I run both of them one after the other (I have one quarter of the performances I would usually get).
I believe this is because the subroutine testing the passwords has a huge (10k) lookup table for AES decryption, and this isn't shared between multiple kernels running at the same time within the same workgroup.
I know it isn't shared because the AES lookup table is allocated as __local inside every single kernel and then initialised copying the values from an external library (as in, the kernel creates a local copy of the static memory and uses that).
I've tried changing the __local allocation/initialization to a __constant variable, a pointer pointing to the library's constant memory, but this gets me a 10x performance reduction.
I can't make any sense of this. What should I do to make sure my constant memory is allocated only once per work group, and every kernel working in the same workgroup can share read operations there?
__constant memory by definition is shared by all work groups, so I would expect that in any reasonable implementation it is only allocated on the compute device once per kernel already.
On the other hand if you have two separate kernels that you are enqueueing back-to-back, I can't think of a reasonable way to guarantee that some __constant memory is shared or preserved on the device for both. If you want to be reasonably sure that some buffer is copied once to the compute device for use by both subroutines, then the subroutines should be a part of the same kernel.
In general, performance will depend on the underlying hardware & OpenCL implementation, and it will not be portable across different devices. You should see if there's an OpenCL performance guide for the hardware you are using.
As for why __constant memory may be slower than __local memory, again it depends on the hardware and how the OpenCL implementation maps address spaces to memory locations on the hardware. Your mistake is in assuming that __constant memory will be faster since it is by definition consistent. Where the memory is on the device will dictate how fast it is (i.e. a fast per-work-group buffer, vs a slower buffer shared by all work groups on the device) and the OpenCL address space is only one factor in how/where the OpenCL implementation will allocate memory. (Size matters also, and it's conceivable that if your __constant memory is small enough it will be "promoted" to faster per-work-group memory, but that totally depends on the implementation.)
If __local memory is faster as you say, then you might consider splitting up your work into work-group-sized chunks and passing in only that part of the table required by a work group to a __local buffer as a kernel parameter.

OpenCL : Id of the physical core being used

I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK

Is there a maximum limit to private memory in OpenCL?

Does the OpenCL specification set any maximum limit on the amount of private memory that can be used? If so, how do I get this number?
I have a function which gives the correct result when run outside OpenCL, but when converted to a kernel, it spews out garbage. I checked the amount of private memory being used per work item using the CL_KERNEL_PRIVATE_MEM_SIZE flag and it is ~ 4000 bytes. I suspect that I am using too much private memory and this is somehow leading to junk computation.
As per OpenCL spec the location and size is not defined i.e. it left for vendor to decide. Which puts a question on How much is to be used. If used correctly gets the best performance and if not can be became the cause for slowdown.
You can use AMD's CodeXL or NVIDIA's Nsight (If you have AMD or NVIDIA cards) to analyze memory usage by the kernel. With little hands on tool you can understand the register spilling using these tool.
I don't think that the high usage of private memory will lead to the junk result, it could certainly be a issue in your code.
Its different for different archs. For example, a hd7870's private memory per compute-unit is 256kB and if your setting is 64 threads per compute unit, then each thread will have 4kB private memory which means 1000 float values. If you increase threads per compute unit further, privates/thread will drop to even 1kB range. You should add some local memory usage to balance it.
More importantly, you can not use all of it. Compiler uses big portion for its own optimizations and some things that I dont know. You can never be sure without a profiler.
There isn't a theoretical limit for private memory (unlike local memory). If there was, clGetDeviceInfo would list it (it doesn't). However, I know there are practical limits. For example, some GPU implementations will try and store private memory in the register file if it fits. If you exceed this, it spills out to main memory and may be orders of magnitude more expensive. Regardless, the result should be correct (just achieved much slower). It should not junk your computation.

Will replicating Global Data on an OpenCL compliant device improve performance?

I have a pretty small dataset, but large enough that it won't fit in the workspace or private memories in any GPU currently on the market. What this means is that each kernel must access the data in the global memory on the GPU. If I replicate this data to multiple copies in the global memory, can it increase performance/reduce latency, or is the memory controller restrictive and will only allow one core to access the global memory at a time? If this is device specific, are there any models which have this feature?
This is very much bound by the memory controller of the video card, and multiple copies of the same data won't help you. I am unaware of a gpu having more than one memory controller for global access.
Your access pattern of the memory will greatly effect the overall throughput of your kernel. Do you have a specific example/kernel that you need optimized?

Is there a limit to OpenCL local memory?

Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of the cl functions returns an error code.
I further tried only to add one of the two smaller variables. If I add only one of them, it works, but if I add both of them, it breaks down.
So could this behavior of OpenCL mean, that I allocated to much __local memory? How do I find out, how much __local memory is usable by me?
The amount of local memory which a device offers on each of its compute units can be queried by using the CL_DEVICE_LOCAL_MEM_SIZE flag with the clGetDeviceInfo function:
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
The size returned is in bytes. Each workgroup can allocate this much memory strictly for itself. Note, however, that if it does allocate maximum, this may prevent scheduling other workgrups concurrently on the same compute unit.
Of course there is, since local memory is physical rather than virtual.
We are used, from working with a virtual address space on CPUs, to theoretically have as much memory as we want - potentially failing at very large sizes due to paging file / swap partition running out, or maybe not even that, until we actually try to use too much memory so that it can't be mapped to the physical RAM and the disk.
This is not the case for things like a computer's OS kernel (or lower-level parts of it) which need to access specific areas in the actual RAM.
It is also not the case for GPU global and local memory. There is no* memory paging (remapping of perceived thread addresses to physical memory addresses); and no swapping. Specifically regarding local memory, every compute unit (= every symmetric multiprocessor on a GPU) has a bunch of RAM used as local memory; the green slabs here:
the size of each such slab is what you get with
clGetDeviceInfo( · , CL_DEVICE_LOCAL_MEM_SIZE, · , ·).
To illustrate, on nVIDIA Kepler GPUs, the local memory size is either 16 KBytes or 48 KBytes (and the complement to 64 KBytes is used for caching accesses to Global Memory). So, as of today, GPU local memory is very small relative to the global device memory.
1 - On nVIDIA GPUs beginning with the Pascal architecture, paging is supported; but that's not the common way of using device memory.
