I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:
Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.
So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.
I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).
EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?
It sounds like you are having implementation trouble.
There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.
I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.
Appendix A of the OpenCL 1.2 specification says that:
OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.
Further:
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.
So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.
To put things more in terms of your question:
What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?
... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.
As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.
I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.
Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.
If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.
All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.
I hope I was helpful,
Ryan
Related
that might be a noob question but I want to use opencl to take advantage of the dozens of the gpu cores. A couple of days before, when I satrted searching about programing with opencl, I got confused with workgroups, work items, kernels and the logic of opencl. Before I proceed dealing with this staff, here is my quetion:
Can I just assign a thread with code to run in a single gpu compute core (or specified core) just like when you program a multi-core cpu?
No, that's not how it works. In OpenCL you write a kernel that executes a single work item of work. It might be as simple as a memory copy, or could read pixels from source images, mix them together, and write a pixel to an output image. This kernel gets executed across the whole work group (e.g., the whole output image). The runtime makes that happens. It's not like multithreaded CPU code where each thread does different stuff. It's more like having a warehouse full of 1000 interns. Each has a unique employee number, and the stuff in the warehouse has numbers, so you can say things like "look in boxes (your number) and (your number plus 1000) and put the pieces you find inside together and put the new part in box (your number plus 2000)". You say that once in the megaphone, and 1000 parts get built in parallel.
I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
Thanks!
There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.
Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.
My classmates and me are being confronted with OpenCL for the first time. As expected, we ran into some issues. Below I summarized the issues we had and the answers we found. However, we're not sure that we got it all right, so it would be great if you guys could take a look at both our answers and the questions below them.
Why didn't we split that up into single questions?
They partly relate to each other.
We think these are typical beginner's questions. Those fellow students who we consulted all replied "Well, that I didn't understand either."
Work items vs. Processing elements
In most of the lectures on OpenCL that I have seen, they use the same illustration to introduce computing units and processing elements as well as work groups and work items. This has led my classmates and me to continuously confuse these concepts. Therefore we now came up with a definition that emphasizes on the fact that processing elements are very different from work items:
A work item is a kernel that is being executed, whereas a processing element is an abstract model that represents something that actually does computations. A work item is something that exists only temporarily in software, while a processing element abstracts something that physically exists in hardware. However, depending on the hardware and therefore depending on the OpenCL implementation, a work item might be mapped to and executed by some piece of hardware that is represented by a so-called processing element.
Question 1: Is this correct? Is there a better way to express this?
NDRange
This is how we perceive the concept of NDRange:
The amount of work items that are out there is being represented by the NDRange size. Commonly, this is also being referred to as the global size. However, the NDRange can be either one-, two-, or three-dimensional ("ND"):
A one-dimensional problem would be some computation an a linear vector. If the vector's size is 64 and there are 64 work items to process that vector, then the NDRange size equals 64.
A two-dimensional problem would be some computation on an image. In the case of an 1024x768 image, the NDRange size Gx would be 1024 and the NDRange size Gy would be 768. This assumes, that there are 1024x768 work items out there to process each pixel of that image. The NDRange size then equals 1024x768.
A three-dimensional example would be some computation on a 3D model or so. Additionally, there is NDRange size Gz.
Question 2: Once again, is this correct?
Question 3: These dimensions are simply out there for convienence right? One could simply store the color values of each pixel of an image in a linear vector of the size width * height. The same is true for any 3D problem.
Various
Question 4: We were being told that the execution of kernels (in other words: work items) could be synchronized within a work group using barrier(CLK_LOCAL_MEM_FENCE); Understood. We were also (repeatedly) being told that work groups cannot be synchronized. Alright. But then what's the use of barrier(CLK_GLOBAL_MEM_FENCE);?
Question 5: In our host program, we specify a context that consists of one or more device(s) from one of the available platforms. However, we can only enqueue kernels in a so-called command queue that is linked to exactly one device (that has to be in the context). Again: The command queue is not linked to the previously defined context, but to a single device. Right?
Question 1: Almost correct. A work-item is an instance of a kernel (see paragraph 2 of section 3.2 of the standard). See also the definition of processing element from the standard:
Processing Element: A virtual scalar processor. A work-item may
execute on one or more processing elements.
see also the answer I provided to that question.
Question 2 & 3: Use more than one dimensions or the exact same number of work-items than you have data elements to process depends on your problem. It's up to you and how easier the development would be. Note also that you have a constrain with ocl 1.2 and below which forces you to have the global size a multiple of the work-group size (removed with ocl 2.0).
Question 4: Yes, synchronization during the execution of a kernel is only possible within a work-group thanks to barriers. The difference between the flags you pass as parameter refer to the type of memory. With CLK_LOCAL_MEM_FENCE all work-items will have to make sure that data they have to write in local memory will be visible to the others. With CLK_GLOBAL_MEM_FENCE it's the same but for global memory
Question 5: Within a context you can have several devices having themselves several command queues. As you stated, a command-queue is linked to one device, but you can enqueue your kernels in different command-queues from different devices. Note that if two command-queues try to access the same memory object (without sync) you get an undefined behavior. You'd typically use two or more command queues when their respective jobs are not related.
However you can synchronized command-queues through events and as a matter of fact you can also create your own events (called user events) see section 5.9 for event and section 5.10 for user events (of the standard).
I'd advice you to read at least the first chapters (1 to 5) of the standard. If you're in a hurry, at least the chap 2 which is actually the glossary.
I'm fairly new to openCL and GPGPU programming and would like to clarify something:
Do work-groups interleave like warps within a work-group on a SM of Nvidia card?
Or they are always serialized, meaning one work-group has to retire before the next one comes in?
thanks
Eugene
You are taking the wrong approach. You simply can't known how they are going to be scheduled.
In fact this is KEY element in the parallel aproach, that you can run millions of threads with little needs of sync between them. If you need to know how to sync them, then it would be a hell.
Additionally, it is not that a given device runs always the work groups in the same order. The order differes each launch. The amount of parallel workgroups varies also, so it can be groups of 4 thengroups of 5 (for example).
Take this into account when designing, you should completely detach each work-item to work on it's own.
Suppose that I've two big functions. Is it better to write them in a separate kernels and call them sequentially, or is better to write only one kernel? (I don't want to read the data back and force form between host and device in between). What about the speed up if I want to call the kernel many times?
One thing to consider is the effect of register pressure on hardware utilization and performance.
As a general rule, big kernels have big register footprints. Typical OpenCL devices (ie. GPUs) have very finite register file sizes and large kernels can result in lower concurrency (fewer concurrent warps/wavefronts), less opportunities for latency hiding, and poorer overall performance. On the other hand, kernel launch overheads are pretty low on most platforms, so if your algorithm doesn't have an enormous amount of state to save between "phases" of execution, the penalty of using multiple kernels can be rather low.
Using multiple kernels also has another side benefit -- you get implicit synchronization between all work units for free. Often that can eliminate the need for atomic memory operations and synchronization primitives which can have a negative impact on code performance.
The ultimate guide should be measured performance. There is no universal rule-of-thumb for this sort of things. Benchmarking is the only way to know for sure.
In general this is a question of (maybe) slightly better performance vs. readibility of your code. Copying buffers is no issue as long as you keep them within the same context. E.g. you could set one output buffer of a kernel to be an input buffer of the next kernel, which would not involve any copying.
The proper way to code in OpenCL is to separate your code into parallel tasks, and each of them is a kernel. This is, each "for loop" should be a kernel. Some times one single CPU code function could result in a 4 kernel implementation in OCL.
If you need to store data between kernel executions just use OpenCL buffers and do not copy to host (this solves the DEVICE<->HOST bottleneck).
If both functions act to different data you could propably write a single kernel, but that depends on the complexity of the operation being run.