CUDA recursion depth

CUDA recursion depth - recursion

When using Dynamic Parallelism in CUDA, you can implement recursive algorithms like mergeSort. I have implemented it and my program don't work for inputs greater than blah.
My question is how many depth in the recursion tree the implementation can go? Is there any limitation? (My program is just fine for smaller inputs.)

From Professional CUDA C Programming:
The maximum nesting depth of dynamic parallelism is limited to 24, but in reality most kernels will be limited by the amount of memory required by the device runtime system at each new level . . .

Related

Limit GPU Memory in Julia using CuArrays

I'm fairly new to julia and I'm currently trying out some deep convolution networks with recurrent structures. I'm training the networks on a GPU using
CuArrays(CUDA Version 9.0).
Having two separate GPU's, I started two instances with different datasets.
Soon after some training both julia instances allocated all available Memory (2 x 11GB) and I couldn't even start another instance on my own using CuArrays (Memory allocation error). This became quite a problem, since this is running on a Server which is shared among many people.
I'm assuming that this is a normal behavior to use all available memory to train as fast as possible. But, under these circumstances I would like to limit the memory which can be allocated to run two instances at the same time and don't block me or other people from using the GPU.
To my surprise I found only very, very little information about this.
I'm aware of the CUDA_VISIBLE_DEVICES Option but this does not help since I want to train simultaneously on both devices.
Another one suggested to call GC.gc() and CuArrays.clearpool()
The second call throws an unknown function error and seems not to be within the CuArray Package anymore. The first one I'm currently testing but not exactly what I need. Is there any possibilty to limit the allocation of RAM on a GPU using CuArrays and Julia?
Thanks in advance
My Batchsize is 100 and one batch should have less than 1MB...

There is currently no such functionality. I quickly whipped something up, see https://github.com/JuliaGPU/CuArrays.jl/pull/379, you can use it to define CUARRAYS_MEMORY_LIMIT and set it to an amount of bytes that the allocator will not go beyond. Note that this might significantly increase memory pressure, a situation for which the CuArrays.jl memory allocator is currently not optimized (though it is one of my top priorities for the Julia GPU infrastructure).

OpenCL dead lock possibility

I'm using global atomics to synchronize between work groups in OpenCL.
So the kernel uses a code like
... global volatile uint* counter;
if(get_local_id(0) == 0) {
while(*counter != expected_value);
}
barrier(0);
To wait until counter becomes expected_value.
And at another place it does
if(get_local_id(0) == 0) atomic_inc(counter);
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly. But if one work group starts only after another has completely finished, then the kernel can deadlock.
On CPU and on GPU (NVidia CUDA platform), it seems to always work, with a large number of work groups (over 8000).
For the algorithm this seems to be the most efficient implementation. (It does a prefix sums over each line in a 2D buffer.)
Does OpenCL and/or NVidia's OpenCL implementation guarantee that this always works?

Does OpenCL and/or NVidia's OpenCL implementation guarantee that this
always works?
As far as the OpenCL standard is concerned, this is not guaranteed (similarly for CUDA). Now, in practice, it may very well work due to your specific OpenCL implementation, but bear in mind that it's not guaranteed by the standard, so make sure you understand your implementation's execution model to ensure this is safe, and that such code won't necessarily be portable across other conforming implementations.
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly
OpenCL states that work groups can run in any order, and not necessarily in parallel nor even concurrently. CUDA has similar wording, although CUDA 9 does support a form of grid-wise synchronization.
OpenCL spec, 3.2.2 Execution Model: Execution of kernel-instances:
A conforming implementation may choose to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute in parallel. There is no safe and portable way to synchronize across the independent execution of work-groups since once in the work-pool, they can execute in any order.

GPU Brute-Force Implementation

I am asking for an advise for the following problem:
For a research-project I am writing a brute-force algorithm based on a GPU with (py)OpenCl.
(I know JTR is out there)
Right now I do have a Brute-Force-Generator in Python which is filling up for each round the buffer with words (amount=1024*64).I pass the buffer to the GPU Kernel. The GPU is calculating for each value in the buffer a MD5 Hash Value and compares it with a given one. Great that it works!
BUT:
I don't think this is really the full performance i can get from the GPU - or is it? Isn't there a bottleneck when i have to fill up the buffer by the CPU and pass it to the GPU 'just' for a Hash calculation an comparison - or am i wrong and this is already the fastet or almost the fastet performance i can get?
I have done a lot of Research before I consider to ask this question here. I couldn't find a brute force implementation on the GPU kernel so far - why?
Thx
EDIT 1:
I try to explain it in a different way what I want to know. Lets say I have an average computer. Performing an brute-force-algorithm on a GPU is faster than on a CPU (if you do it right). I have looked through some GPU brute-force tools and couldn't find one with the whole brute-force implementation on the GPU Kernel.
Right now I am passing "word packages" to the GPU and let them do the work (hash & compare) there - looks like this is the common way . Isn't it faster to 'split' the brute-force algorithm so each Unit on the GPU will generate its own "word packages" by itself.
All I do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?

About the performance of the "brute-force" approach.
All i do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?
I do not know the details of your algorithm, but, in general, there are some points to consider before creating a hybrid CPU-GPU algorithm. Just to name a few:
Different architectures (best CPU algorithm probably is not the best
GPU algorithm).
Extra synchronization points.
Different memory spaces (implies PCIe/network transfers).
More complex algorithms
More complex fine tuning.
Vendors policy.
Nevertheless, there are quite a few examples out there that combines the power of the GPU and the CPU at the same time. Typically, sequential or highly divergent parts of the algorithm will run on the CPU while the homogeneous, computing intensive part runs on the GPU. Other applications, uses the CPU to preprocess the input data to a format that is more amenable to GPU processing (for instance, changing the data layout). Finally, there are applications targeting pure performance that really do a significant amount of work on the CPU, like the MAGMA project.
In summary, the answer it that it really depends on the details of your algorithm if it is really possible or if it worth it to design a hybrid algorithm that takes the most of your CPU-GPU system as a whole.
About the performance of your current approach
I think you should break down your question in two parts:
It is my GPU kernel efficient?
How much time am I actually doing work at the GPU?
Regarding the first one, you didn't provide any information about your GPU kernel so we could not really help you with it, but general optimization approaches apply:
Is it your computation memory/compute bound?
How far are you from your GPU peak memory bandwidth?
You need to start from these question in order to known what kind of optimization/algorithm you should apply. Take a look at the roofline performance model.
As for the second question, even though you don't go into detail, it seems like your application spend so much time on small memory transfers (take a look at this article about how to optimize memory transfers). The overhead of starting the PCIe just to send a few words would kill any performance benefit you get from using a GPU device. Thus, sending a bunch of small buffers instead of large chunks of memory packing a large number of them is not, in general, the way to go.
If you're looking for performance, you may want to overlap the computation and the memory transfer. Read this article for more information.
As a general recommendation, before implementing any optimization, take some time to profile your application. It would save you a lot of time.

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?

Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.

With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.

For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.

Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

Developing with OpenCl on ATI and Nvidia on the same time

our workgroup is slowly trying a little bit of OpenCl in a side project. So far 'everybody' is working on NVIDIA Quadro FX 580. Now we are planning to buy new computers for new colleages and instead of the FX 580 we could buy ATI FirePro V4800 instead, which costs only 15Eur more and give us 1Gig instead of 512Gig of Ram which will benificial for our data intensive tasks.
So, how much trouble is it to develop OpenCl code at the same time on Nvidia and ATI?
I read the following SO question, Running OpenCL on hardware from mixed vendors, which was very pessimistic about developing on/for different vendors. On the other side, the question is already a year old.
What do you reccomend?

I have previous worked extensively with CUDA programming language.
I have been planning to start developing apps using OpenCL. As you mentioned one of the best features with OpenCL is running on many vendor hardware (Intel, AMD and Nvidia).
One project that I came across that used openCL extensively for large scale development is http://sourceforge.net/projects/hypgad/. It might be a good idea to look at the source code from this group and understand how they have developed their application on so many hardware including sony cell processor.
Another approach would be to use PyOPENCL, which provides higher abstraction than OpenCL and can significantly reduce the coding effort.

Do you need the code to run unchanged on both bits of hardware? If so you may have to develop for a limited subset of common functions.
If you can run slightly different c ode on each you will probably get better performance - in CUDA/OpenCL you generally have to tune the algorithms for the amount of ram, number of GPU engines anyway so it shoudldn't be much more work to also tweak for NVidia/AMD

The biggest problem is workgroup sizes. Some ATI cards I have used crash at above 64, but then it may be the Apple OSX 10.6 drivers I am using.

Developing for both ATI and NVIDIA is actually not too difficult so long as you avoid using any part of either vendor's SDK. Stick to OpenCL as it is defined in the OpenCL spec. (www.khronos.org/opencl) and your code will stay syntax portable. Due to differences in the underlying architectures, performance portability may be an issue. Local & Global worksizes really have to be determined independently for each card to maximize performance. Another thing to pay attention to is the types being used. Vector types (float2, float4) are especially useful on ATI cards, as each processing element actually contains 4 execution units (one for each RGB color channel, plus aplha).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex