I apologize in advance for the vagueness of this question.
Background:
I am attempting to write a morphological image processing function in OpenCL. I have a __local buffer which I use to store data for every pixel (each pixel is represented by a work-item, no loop unrolling yet). Also, since I am early in testing, I am only using a single work-group (8x8 pixel image so I can manually validate results).
Problem:
There are occasions when data from one, two, three, or even four pixels must be added into the pixel buffer of another. Since these are adjacent pixel in the same workgroup, I am sure I am causing local memory bank conflicts. That's ok, speed isn't my top priority (yet!). However, these bank conflicts seem to be dropping data and even corrupting data. I've been very careful not to overflow or over run the buffers.
So, my first question is: is it, in fact, possible that the the bank conflicts are causing data corruption and loss? The Opencl spec seems to indicate that the operation should serialize, slowing down the bandwidth - but there is no mention of data loss.
My second question is: Help! - What can I do about this?
Any guidance will be greatly appreciated - thanks!
maybe the nvidia whitepaper Prefix Sum (Scan) with CUDA can bring you on the right track. It is about the all-prefix-sums algorithm, which is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm.
The all-prefix-sums operation turns lists of numbers [3,4,1,2] into their sums: [0,3,7,8].
I know the paper is about CUDA, but I found that the resulting kernels are very similar as
both tchnologies use similar concepts.
I hope, the paper can help you.
Cheers
Related
I am asking for an advise for the following problem:
For a research-project I am writing a brute-force algorithm based on a GPU with (py)OpenCl.
(I know JTR is out there)
Right now I do have a Brute-Force-Generator in Python which is filling up for each round the buffer with words (amount=1024*64).I pass the buffer to the GPU Kernel. The GPU is calculating for each value in the buffer a MD5 Hash Value and compares it with a given one. Great that it works!
BUT:
I don't think this is really the full performance i can get from the GPU - or is it? Isn't there a bottleneck when i have to fill up the buffer by the CPU and pass it to the GPU 'just' for a Hash calculation an comparison - or am i wrong and this is already the fastet or almost the fastet performance i can get?
I have done a lot of Research before I consider to ask this question here. I couldn't find a brute force implementation on the GPU kernel so far - why?
Thx
EDIT 1:
I try to explain it in a different way what I want to know. Lets say I have an average computer. Performing an brute-force-algorithm on a GPU is faster than on a CPU (if you do it right). I have looked through some GPU brute-force tools and couldn't find one with the whole brute-force implementation on the GPU Kernel.
Right now I am passing "word packages" to the GPU and let them do the work (hash & compare) there - looks like this is the common way . Isn't it faster to 'split' the brute-force algorithm so each Unit on the GPU will generate its own "word packages" by itself.
All I do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?
About the performance of the "brute-force" approach.
All i do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?
I do not know the details of your algorithm, but, in general, there are some points to consider before creating a hybrid CPU-GPU algorithm. Just to name a few:
Different architectures (best CPU algorithm probably is not the best
GPU algorithm).
Extra synchronization points.
Different memory spaces (implies PCIe/network transfers).
More complex algorithms
More complex fine tuning.
Vendors policy.
Nevertheless, there are quite a few examples out there that combines the power of the GPU and the CPU at the same time. Typically, sequential or highly divergent parts of the algorithm will run on the CPU while the homogeneous, computing intensive part runs on the GPU. Other applications, uses the CPU to preprocess the input data to a format that is more amenable to GPU processing (for instance, changing the data layout). Finally, there are applications targeting pure performance that really do a significant amount of work on the CPU, like the MAGMA project.
In summary, the answer it that it really depends on the details of your algorithm if it is really possible or if it worth it to design a hybrid algorithm that takes the most of your CPU-GPU system as a whole.
About the performance of your current approach
I think you should break down your question in two parts:
It is my GPU kernel efficient?
How much time am I actually doing work at the GPU?
Regarding the first one, you didn't provide any information about your GPU kernel so we could not really help you with it, but general optimization approaches apply:
Is it your computation memory/compute bound?
How far are you from your GPU peak memory bandwidth?
You need to start from these question in order to known what kind of optimization/algorithm you should apply. Take a look at the roofline performance model.
As for the second question, even though you don't go into detail, it seems like your application spend so much time on small memory transfers (take a look at this article about how to optimize memory transfers). The overhead of starting the PCIe just to send a few words would kill any performance benefit you get from using a GPU device. Thus, sending a bunch of small buffers instead of large chunks of memory packing a large number of them is not, in general, the way to go.
If you're looking for performance, you may want to overlap the computation and the memory transfer. Read this article for more information.
As a general recommendation, before implementing any optimization, take some time to profile your application. It would save you a lot of time.
So I'm having a hard time grasping the idea behind pointers and all that memory allocation.
I'm thinking nowadays with computer as powerful as they are right now why do we have to use pointers at all?
Isn't there always a workaround to do things without the help of pointers?
Pointers are an indirection: instead of working with the data itself, you are working with (something) that points to the data. Depending on the semantics of the language, this allows many things: cheaply switch to another instance of data (by setting the pointer to point to another instance), passing pointers allows access to the original data without having to make (a possibly expensive) copy, etc.
Memory allocation is related to pointers, but separate: you can have pointers without allocating memory. The reason you need pointers for memory allocation is that the actual address the allocated block of memory resides is not known at compile time, so you can only access it via a level of indirection (i.e. pointers) -- the compiler statically allocates space for the pointer that will point to the dynamically allocated memory.
Pointers are incredibly powerful. Just because computers have a faster processing time nowdays, doesn't mean that's any reason to abandon something as essential as pointers. Passing around giant chunks of memory on the stack is inefficient at best, catastrophic at worst. With pointers, you only need to maintain a reference to where the data resides, rather than duplicating huge chunks of memory each time you call a function.
Also, if you're copying all the data every time, how do you modify the original data? Aside from returning the copy of the structure in every call that touches it.
I remember reading somewhere that Dijkstra was assessing a student for a programming course; this student was quite intelligent but s/he wasn't able to solve the problem because there was sort of a mental block.
All the code was sort of ok, but what was needed was simply to use the expression
a[a[i+1]] = j;
and even if being so close to the solution still the goal seemed to be miles away.
Languages "without pointers" already exist... e.g. BASIC. Without explicit pointers, that is. But the indirection idea... the idea that you can have data to mean just where to find other data is central to programming.
The very idea of array is about being able to use computed values to find other values.
Trying to hide this idea is an horrible plan. According to Dijkstra anyone that has been exposed to the BASIC language has already received such a mental mutilation that is impossible to recover as a good programmer (and probably the absence of explicit indirection was one of the problems).
I think he was exaggerating.
Just a bit.
I'm capturing video frames. Every frame is passed into the kernel as an Image2D. I've got about five simple image processing algorithms (blur, sharpen etc.) which a user can choose of (also a combination of different ones is possible). I see three possibilities here:
One kernel: At runtime construct the string of the kernel with the
chosen algorithms and compile (and take the overhead of the one-time
compiling delay)
One kernel: Handle the chosen algorithms with flags (although I understand that conditional branches are undesirable)
Many kernels (one per algorithm): Seems to be the issue, that an Image2D can either be read_only or write_only and I would need to repetively copy the image from and to the GPU as one output image of a kernel is the input image of the next kernel.
Is there a suggested rule of thumb which way to follow?
One workaround for the readonly/writeonly problem could be to use buffers for the middle steps.
Image2D -> buffer0 -> buffer1 -> ... bufferN -> Image2D
Or use two buffers and alternate with them if you don't need the intermediate results. (I2d, B0, B1, B0, ..., I2D)
You would probably need to know in advance how many filters you are applying, but that shouldn't be much of an issue.
I suggest that you try and avoid the first two options
This way you will have a very unreadable and complicated kernel code. which is OK if you are a 100% sure that the code does what it is supposed to do. That is with considerations on the arrangement and grouping of your work-items. The point is that it will be difficult and to debug and maintain your kernels this way.
This, I think, is even worse than the first option if your threads can go down in different paths. Simple branching can reduce your performances and create troubles for synchronizations. IN ADDITION to simple branching you should consider the fact that your algorithms might require different number and arrangement of threads. If that is the case then using one kernel for all operations is a really bad idea.
I haven't tried it myself but I think you should try out the option suggested by #mfa.
I am trying to implement a GEMM implmentation using AMD-APP-SDK 2.4 on a ATI HD 6990 card (Cayman architecture).
One of the optimizing techniques is the use of blocking/tiling.
In its implementation, is it faster if we store the sub-matrices in the shared local memory or is it faster when we use a texture cache? If possible please give the reason also.
Please also suggest which is easier to implement.
Thanks.
P.S. I want it for single precision only, if it matters!
Note: The size of the sub matrix is not an issue, however I feel that since the larger it is the better it would be. The only factor to be taken in consideration is that if unit of memory is 128 bit (4 single precision) then, block size should be a multiple of 4.
The Cypress chips were used in the 5800 series Radeons. The 6900 series uses the Cayman core, which has several important differences, most notably that it is a VLIW4 architecture instead of the VLIW5 configuration used in earlier cores.
As always, the only definitive way to know which method is faster is to benchmark it. In particular, since you give no information about the size of the sub-matrices, it is hard to say where they will best fit.
Imagine you have a channel of communication that is inherently lossy and one-way. That is, there is some inherent noise that is impossible to remove that causes, say, random bits to be toggled. Also imagine that it is one way - you cannot request retransmission.
But you need to send data over it regardless. What techniques can you use to send numbers and text over that channel?
Is it possible to encode numbers so that even with random bit twiddling they can still be interpreted as values close to the original (lossy transmittion)?
Is there a way to send a string of characters (ASCII, say) in a lossless fashion?
This is just for fun. I know you can use morse code or any very low frequency binary communication. I know about parity bits and checksums to detect errors and retrying. I know that you might as well use an analog signal. I'm just curious if there are any interesting computer-sciency techniques to send this stuff over a lossy channel.
Depending on some details that you don't supply about your lossy channel, I would recommend, first using a Gray code to ensure that single-bit errors result in small differences (to cover your desire for loss mitigation in lossy transmission), and then possibly also encoding the resulting stream with some "lossless" (==tries to be loss-less;-) encoding.
Reed-Solomon and variants thereof are particularly good if your noise episodes are prone to occur in small bursts (several bit mistakes within, say, a single byte), which should interoperate well with Gray coding (since multi-bit mistakes are the killers for the "loss mitigation" aspect of Gray, designed to degrade gracefully for single-bit errors on the wire). That's because R-S is intrinsically a block scheme, and multiple errors within one block are basically the same as a single error in it, from R-S's point of view;-).
R-S is particularly awesome if many of the errors are erasures -- to put it simply, an erasure is a symbol that has most probably been mangled in transmission, BUT for which you DO know the crucial fact that it HAS been mangled. The physical layer, depending on how it's designed, can often have hints about that fact, and if there's a way for it to inform the higher layers, that can be of crucial help. Let me explain erasures a bit...:
Say for a simplified example that a 0 is sent as a level of -1 volt and a 1 is send as a level of +1 volt (wrt some reference wave), but there's noise (physical noise can often be well-modeled, ask any competent communication engineer;-); depending on the noise model the decoding might be that anything -0.7 V and down is considered a 0 bit, anything +0.7 V and up is considered a 1 bit, anything in-between is considered an erasure, i.e., the higher layer is told that the bit in question was probably mangled in transmission and should therefore be disregarded. (I sometimes give this as one example of my thesis that sometimes abstractions SHOULD "leak" -- in a controlled and architected way: the Martelli corollary to Spolsky's Law of Leaky Abstractions!-).
A R-S code with any given redundancy ratio can be about twice as effective at correcting erasures (errors the decoder is told about) as it can be at correcting otherwise-unknown errors -- it's also possible to mix both aspects, correcting both some erasures AND some otherwise-unknown errors.
As the cherry on top, custom R-S codes can be (reasonably easily) designed and tailored to reduce the probability of uncorrected errors to below any required threshold θ given a precise model of the physical channel's characteristics in terms of both erasures and undetected errors (including both probability and burstiness).
I wouldn't call this whole area a "computer-sciency" one, actually: back when I graduated (MSEE, 30 years ago), I was mostly trying to avoid "CS" stuff in favor of chip design, system design, advanced radio systems, &c -- yet I was taught this stuff (well, the subset that was already within the realm of practical engineering use;-) pretty well.
And, just to confirm that things haven't changed all that much in one generation: my daughter just got her MS in telecom engineering (strictly focusing on advanced radio systems) -- she can't design just about any serious program, algorithm, or data structure (though she did just fine in the mandatory courses on C and Java, there was absolutely no CS depth in those courses, nor elsewhere in her curriculum -- her daily working language is matlab...!-) -- yet she knows more about information and coding theory than I ever learned, and that's before any PhD level study (she's staying for her PhD, but that hasn't yet begun).
So, I claim these fields are more EE-y than CS-y (though of course the boundaries are ever fuzzy -- witness the fact that after a few years designing chips I ended up as a SW guy more or less by accident, and so did a lot of my contemporaries;-).
This question is the subject of coding theory.
Probably one of the better-known methods is to use Hamming code. It might not be the best way of correcting errors on large scales, but it's incredibly simple to understand.
There is the redundant encoding used in optical media that can recover bit-loss.
ECC is also used in hard-disks and RAM
The TCP protocol can handle quite a lot of data loss with retransmissions.
Either Turbo Codes or Low-density parity-checking codes for general data, because these come closest to approaching the Shannon limit - see wikipedia.
You can use Reed-Solomon codes.
See also the Sliding Window Protocol (which is used by TCP).
Although this includes dealing with packets being re-ordered or lost altogether, which was not part of your problem definition.
As Alex Martelli says, there's lots of coding theory in the world, but Reed-Solomon codes are definitely a sweet spot. If you actually want to build something, Jim Plank has written a nice tutorial on Reed-Solomon coding. Plank has a professional interest in coding with a lot of practical expertise to back it up.
I would go for some of these suggestions, followed by multiple sendings of the same data. So that way you can hope for different errors to be introduced at different points in the stream, and you may be able to infer the desired number a lot easier.