I'm capturing video frames. Every frame is passed into the kernel as an Image2D. I've got about five simple image processing algorithms (blur, sharpen etc.) which a user can choose of (also a combination of different ones is possible). I see three possibilities here:
One kernel: At runtime construct the string of the kernel with the
chosen algorithms and compile (and take the overhead of the one-time
compiling delay)
One kernel: Handle the chosen algorithms with flags (although I understand that conditional branches are undesirable)
Many kernels (one per algorithm): Seems to be the issue, that an Image2D can either be read_only or write_only and I would need to repetively copy the image from and to the GPU as one output image of a kernel is the input image of the next kernel.
Is there a suggested rule of thumb which way to follow?
One workaround for the readonly/writeonly problem could be to use buffers for the middle steps.
Image2D -> buffer0 -> buffer1 -> ... bufferN -> Image2D
Or use two buffers and alternate with them if you don't need the intermediate results. (I2d, B0, B1, B0, ..., I2D)
You would probably need to know in advance how many filters you are applying, but that shouldn't be much of an issue.
I suggest that you try and avoid the first two options
This way you will have a very unreadable and complicated kernel code. which is OK if you are a 100% sure that the code does what it is supposed to do. That is with considerations on the arrangement and grouping of your work-items. The point is that it will be difficult and to debug and maintain your kernels this way.
This, I think, is even worse than the first option if your threads can go down in different paths. Simple branching can reduce your performances and create troubles for synchronizations. IN ADDITION to simple branching you should consider the fact that your algorithms might require different number and arrangement of threads. If that is the case then using one kernel for all operations is a really bad idea.
I haven't tried it myself but I think you should try out the option suggested by #mfa.
Related
To use OpenCL kernel the following is needed:
Put the kernel code in a string
call clCreateProgramWithSource
call clBuildProgram
call clCreateKernel
call clSetKernelArg (x number of arguments)
call clEnqueueNDRangeKernel
This need to be done for each kernel. Is there a way to do this repeating less code for each kernel?
There is no way to speed up the process. You need to go step by step as you listed.
But it is important to know why it is needed these steps, to understand how flexible the chain is.
clCreateProgramWithSource: Allows to add different strings from different sources to generate the program. Some string might be static, but some might be downloaded from a server, or loaded from disk. It allows the CL code to be dynamic and updated over time.
clBuildProgram: Builds the program for a given device. Maybe you have 8 devices, so you need to call this multiple times. Each device will produce a different binary code.
clCreateKernel: Creates a kernel. But a kernel is an entry point in a binary. So it is possible you create multiple kernels from a program (for different functions). Also the same kernel might be created multiple times, since it holds the arguments. This is useful for having ready-to-be-launched instances with proper parameters.
clSetKernelArg: Changes the parameters in the instance of the kernel. (it is stored there, so it can used multiple times in the future).
clEnqueueNDRangeKernel: Launches it, configuring the size of the launch and the chain of dependencies with other operations.
So, even if you could have a way to just call "getKernelFromString()", the functionality will be very limited, and not very flexible.
You can have look at wrapper libraries
https://streamhpc.com/knowledge/for-developers/opencl-wrappers/
I suggest you look into SYCL. The building steps are performed offline, saving execution time by skipping the clCreateProgramWithSource. The argument setting is done automatically by the runtime, extracting the information from the user lambda
There is also CLU: https://github.com/Computing-Language-Utility/CLU - see https://www.khronos.org/assets/uploads/developers/library/2012-siggraph-opencl-bof/OpenCL-CLU-and-Intel-SIGGRAPH_Aug12.pdf for more info. It is a very simple tool, but should make life a bit easier.
I am frustrated by this architecture since there is no obvious explanation why work groups should be 3 dimensional or I just haven't found the explanation yet. Since any number of dimensions can be emulated from one dimensional work groups it just seems like it adds extra complexity and makes it harder than it already is to understand the best way to divide your work into work groups.
For example, this person discovered that switching axis sped up his execution with a factor of two.
One hypothesis I have is that OpenCL wants a trivial relationship between the work item id and memory lookup to allow predictable memory operations that can be I/O optimized.
Work groups don't have to be three dimensional if your application/algorithm does not require it. You can specify 1, 2, or 3 dimensions -- and no doubt more in the future. So use fewer dimensions when is naturally suits your application.
So why would the specification allow for more dimensions? Like you pointed out, the higher dimensions can be emulated using a single dimension. One example would be a 3-dinensional N-Body simulation, for physics/molecular simulation.
One huge advantage of choosing to use 3D work groups is reducing the code complexity by a fair bit. Under the hood, the SDK you're running openCL on may be doing the emulation for you.
As for the 2x performance gain in your example: this boost was a result of a much better memory access pattern, rather than the hardware inherently being terrible at running on a 2D work group. The answer to that question explains ways to further optimize the kernel, which are great strategies for today's gpu hardware.
A more subtle benefit of using 3D work groups is that future hardware might not need to emulate the extra dimensions. Perhaps the memory, processor, etc would be tailored to 3D work groups, and reduce or eliminate the penalty for bad memory access patterns. If you write your code using 1D groups, you will miss out on a potential performance boost on these platforms. Even today it is possible to create FPGA/ASIC chips to handle 3D work groups better than GPUs.
What really tells you that only 3 dimensions are allowed?
clEnqueueNDRangeKernel() uses an unsigned integer to specify the number of dimensions, and uses an array of unsigned integers for each dimension size.
The OpenCL spec states that the maximum number of dimension is implementation defined as the constant CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, which is in practice often 3, but could be anything. It's just a matter of convenience, as most computational problems operate on "real world" data, which has between 1 to 3 dimensions.
Also, nobody forces you to use 3. Most applications use 1 and 2, and work perfectly fine.
If you are thinking why N and not always 1, you will understand it when you have to use local memory. It is terribly easier to use local memory in an image when the work group is in 2D, since the work items cover a small rectangular zone of the image, instead of a line of it.
You can emulate it with clever index conversions, but using it as the API is designed, it is much easier and more readable.
I wonder what is the difference in terms of running time between executing the MPI_Alltoallv and MPI_Alltoall functions when the amount of transferred data is approximately the same? I couldn't find any such benchmark results. I am interested in large-scale instances, where tens of thousands or better hundreds of thousand of MPI processes are used and where these processes correspond to a substantial part of a given HPC system (considering at best some modern ones, such as BG/Q, Cray XC30, Cray XE6, ...).
Overview
One of the big advantages of MPI_Alltoall is that protocol decisions can be made quickly because they depend on a handful of scalars. In contrast, if a library implementer wants to optimize MPI_Alltoallv, they have to scan four vectors to determine if, for example, the communication is nearly homogeneous, highly sparse, or some other pattern.
The other issue is that MPI_Alltoall can easily use the output buffer as scratch space because every process provides and consumes the same amount of data. For MPI_Alltoallv, it's not practical to do all the bookkeeping, so any scratch space is going to be allocated. I can't remember the specifics of this issue, but I think I've read it somewhere in the MPI canon.
Implementation Skeletons
There are at least two special cases of alltoallv for which one can optimize better than the MPI library can:
Nearly homogeneous communication, i.e. the count vectors are nearly constant. This can happen when you have a distributed array that doesn't divide evenly across the process grid. In this case, you can:
Pad your arrays and use MPI_Alltoall directly.
Use MPI_Alltoall for the subset of processes that have homogeneous communication and either MPI_Alltoallv or a batch of Send-Recv for the remainder. This works best if you can cache the associated communicators. Using nonblocking communication should help too.
Write your own implementation of Bruck that handles the cases where the count varies, which is likely at the end of your vector. Having not done this myself, I don't know how difficult or worthwhile this one is.
Sparse communication, i.e. the count vector contains a large number of zeros. For this case, just use a batch of nonblocking Send-Recv and Waitall, because that's likely the best the MPI library will ever do and doing it yourself allows you to tune the batch size if you want.
Papers
MPI on a Million Processors describes the scalabillity issue associated with vector collectives. Granted, you may not see the cost of scanning the vector arguments on most CPUs, but it is an O(n) problem that motivates implementers to not touch the vector arguments more than necessary.
HykSort: a new variant of hypercube quicksort on distributed memory architectures describes a custom implementation that performs much better than optimized libraries. Such an optimization is rather difficult to implement inside of an MPI library, because it may be rather specialized. (This reference is targeted at Hristo's comment, not your question, by the way.)
Code
You can discover some interesting things by comparing the implementations of these operations in MPICH (https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoall.c and https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoallv.c). Only MPI_Alltoall uses Bruck's algorithm and pairwise exchange. Similar conclusions can be drawn from the available options for I_MPI_ADJUST_ALLTOALL and I_MPI_ADJUST_ALLTOALLV on https://software.intel.com/en-us/node/528906. Whether these limitations are fundamental or merely practical is left as an exercise for the reader.
Practical Experience
When MPI_Alltoall on Blue Gene/P used DCMF_Alltoallv (source code), so there was no difference relative to MPI_Alltoallv, and the latter might have even been better since the application pre-populated the vector arguments.
I wrote a version of all-to-all exchange for Blue Gene/Q that was as fast as MPI_Alltoall. My version was agnostic to constant versus vector arguments so this result implies that MPI_Alltoallv would perform similarly to MPI_Alltoall. However, I can't find the code now to be absolutely sure of the details.
However, Blue Gene networks were rather special, particularly w.r.t. all-to-all, so the behavior on fat-tree or dragonly networks on systems where the CPU is much faster than the network will be quite different.
I suggest you write a benchmark and measure it where you intend to run your application. Once you have some data, it will be much easier to figure out what optimizations may be missed.
I have done a MPI and GPU version of diffusion equation.
In MPI version, I compute next values by doing a decomposition of the grid and each process represents a sub-grid.
In GPU/OpenCL version, I compute next values by converting 2D grid to 1D and looping of the global index of this 1D grid to achieve the update of all grid.
Now, I would like to know if it is possible to mix these both versions, i.e to assign a sub-grid for each MPI process and into the sub-grid, compute the values with GPU/OpenCL.
I think that it's only feasible if GPU is able to share its ressources between different MPI processes (I have only a GPU device)
Anyone could tell me if actually this is possible ?
thanks
Sure, the GPU can be shared between multiple processes. It's still just one resource so if you had it reasonably well utilized before with one process then don't expect much scaling since now your processes are competing for a single resource. Worst case is performance actually gets worse, if you oversubscribe the GPU. Another issue to watch out for is GPU memory usage.
I apologize in advance for the vagueness of this question.
Background:
I am attempting to write a morphological image processing function in OpenCL. I have a __local buffer which I use to store data for every pixel (each pixel is represented by a work-item, no loop unrolling yet). Also, since I am early in testing, I am only using a single work-group (8x8 pixel image so I can manually validate results).
Problem:
There are occasions when data from one, two, three, or even four pixels must be added into the pixel buffer of another. Since these are adjacent pixel in the same workgroup, I am sure I am causing local memory bank conflicts. That's ok, speed isn't my top priority (yet!). However, these bank conflicts seem to be dropping data and even corrupting data. I've been very careful not to overflow or over run the buffers.
So, my first question is: is it, in fact, possible that the the bank conflicts are causing data corruption and loss? The Opencl spec seems to indicate that the operation should serialize, slowing down the bandwidth - but there is no mention of data loss.
My second question is: Help! - What can I do about this?
Any guidance will be greatly appreciated - thanks!
maybe the nvidia whitepaper Prefix Sum (Scan) with CUDA can bring you on the right track. It is about the all-prefix-sums algorithm, which is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm.
The all-prefix-sums operation turns lists of numbers [3,4,1,2] into their sums: [0,3,7,8].
I know the paper is about CUDA, but I found that the resulting kernels are very similar as
both tchnologies use similar concepts.
I hope, the paper can help you.
Cheers