Difference between reading speed from a memory create with CL_MEM_READ_WRITE and CL_MEM_READ flags - opencl

In my project in the first stage, I generate some vertices then in second stage I read these vertices and then create connectivity array. For my vertices I have used CL_MEM_READ_WRITE. I wanted to know will I have a performance increase if I use a CL_WRITE memory in the first stage then copy it in another CL_READ memory for the second stage? Because probably each of them has its own optimization to get the maximum performance.

The flag passed in the 2nd argument Of CL_CREATEBUFER only specifies how the kernel side can access the memory space.

Probably not. I expect the buffer copy to be far more costly than any optimization.
Also, I looked at the AMD APP OpenCL Programming Guide and I didn't find any indication about optimizations when using a READ_ONLY or WRITE_ONLY buffer.
My understanding is that the access flag is only used by the OpenCL runtime to decide when it needs to copy buffer data between the different memory spaces/areas.

Related

CL_MEM_USE_HOST_PTR Vs CL_MEM_COPY_HOST_PTR Vs CL_MEM_ALLOC_HOST_PTR

In the book OpenCl By Action I read this:
CL_MEM_USE_HOST_PTR: The memory object will access the memory region specified by the host
pointer.
CL_MEM_COPY_HOST_PTR: The memory object will set the memory region specified by the host pointer.
CL_MEM_ALLOC_HOST_PTR: A region in host-accessible memory will be allocated for use in data transfer.
I am utterly confused o these three flags.
I would like to know at least how are the first two different.
1-In CL_MEM_USE_HOST_PTR Memory Object will access the memory region while in CL_MEM_COPY_HOST_PTR Memory Object will set the memory region (specified by host in both cases). How is this setting and accessing different ?
Then the third one is again confusing me a lot.
2- Are all of these pinned memory allocation?
CL_MEM_COPY_HOST_PTR simply copies the values at a time of creation of the buffer.
CL_MEM_USE_HOST_PTR maintains a reference to that memory area and depending on the implementation it might access it directly while kernels are executing or it might cache it. You must use mapbuffer to provide synchronization points if you want to write cross platform code using this.
CL_MEM_ALLOC_HOST_PTR is the only one that is often pinned memory. As an example on AMD this one allocates a pinned memory area. Often if you use CL_MEM_USE_HOST_PTR it will simply memcpy internally to a pinned memory area and use that. By using ALLOC_HOST_PTR you will avoid that. But yet again this depends on the implementation and you must read the manufacturers documentation on if this will provide you with pinned memory or not.

How to use async_work_group_copy in OpenCL?

I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:
__kernel void test(__global float *x) {
__local xcopy[GROUP_SIZE];
int globalid = get_global_id(0);
int localid = get_local_id(0);
event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
wait_group_events(1, &e);
}
The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."
But that doesn't clarify my questions...
I would like to know, if the following assumptions are correct:
The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?
Bonus questions:
a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?
b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?
Regards,
Stefan
One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.
You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.
For your specific questions:
I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.
a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.
b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.

Memory test operation without pointers in NXC on NXT?

I'm trying to write a memory test program for the NXT, since I have several with burned memory cells and would like to identify which NXTs are unusable. This program is intended to test each byte in memory for integrity by:
Allocating 64 bits to an Linear Feedback Shift Register randomizer
Adding another byte to a memory pointer
Writing random data to the selected memory cell
Verifying the data is read back correctly
However, I then discovered through these attempts that the NXT doesn't actually support pointer operations. Thus, I can't simply iterate the pointer byte and read its location to test.
How do I go about iterating over indexes in memory without pointers?
I think the problem is that you don't really get direct memory access in either NBC/NXC or RobotC.
From what I know, both run on an NXT firmware emulator; so the bad memory address[es] might change from your program's point of view (assuming the emulator does virtual memory).
To actual run bare metal, I would suggest using the NXTBINARY function of John Hansen's modified firmware as described here:
http://www.tau.ac.il/~stoledo/lego/nxt-native/
The enhanced fimware can be found at:
http://bricxcc.sourceforge.net/test_releases/

OpenCL - Multiple GPU Buffer Synchronization

I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:
Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.
So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.
I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).
EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?
It sounds like you are having implementation trouble.
There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.
I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.
Appendix A of the OpenCL 1.2 specification says that:
OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.
Further:
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.
So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.
To put things more in terms of your question:
What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?
... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.
As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.
I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.
Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.
If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.
All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.
I hope I was helpful,
Ryan

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

Resources