I'm interested in making a faster Myers diff implementation by running it on a GPU, i.e. with OpenCL. I have a good understanding of the algorithm but am new to GPU programming. My hunch is that the GPU will not perform well, but I'd like to hear thoughts and ideas.
Here's a description of one iteration of the algorithm in C. We have two constant buffers of bytes 'left' and 'right' (the data we're comparing), and a shared mutable array of int32 called vector. 'idx' is the iteration index. Then the algorithm is essentially this:
void myers_diff_iteration(const uint8 *left, const uint8 *right, int32 *vector, int32 idx) {
int32 x = MAX(vector[idx-1], vector[idx+1]);
int32 y = idx - x;
while (left[x] == right[y]) {
x++;
y++;
}
vector[x] = x;
}
My guess is that the while loop (which has a very unpredictable number of iterations, ranging from zero to a million) is likely to be very bad on the GPU, and eliminate any performance gain. Is that true? Any hints for how to improve it?
Also, the vector is shared between all iterations of the loop. Each iteration writes to a different location, so there's no synchronization needed (beyond requiring that writes to an aligned 4-byte word don't affect neighboring words). Is such a shared vector going to perform well?
Thanks for any help!
You could try. The GPU will have serious problems with the while loop, but as long as there is enough "iterations" (threads) running, shouldn't be any speed lose.
You could rewrite it this way:
void myers_diff_iteration(const uint8 *left, const uint8 *right, int32 *vector, int32 idx) {
int id = get_global_id(0);
int32 x = MAX(vector[idx-1], vector[idx+1]) + id;
int32 y = idx - x + id;
if (left[x] != right[y]) {
vector[x] = x;
}
}
You only then need to run the max while loop numer of threads. But it will only produce 1 result of the vector each OpenCL run.
The best it to try, and then do some variations.
Related
I'm running experiments aiming to understand the behavior of random read and write access to global memory.
The following kernel reads from an input vector (groupColumn) with a coalesced access pattern and reads random entries from a hash table in global memory.
struct Entry {
uint group;
uint payload;
};
typedef struct Entry Entry;
__kernel void global_random_write_access(__global const uint* restrict groupColumn,
__global Entry* globalHashTable,
__const uint HASH_TABLE_SIZE,
__const uint HASH_TABLE_SIZE_BITS,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
// hash keys are pre-computed
uint hash_key = groupColumn[idx]; // coalesced read access
__global Entry* entry = &globalHashTable[hash_key]; // pointer arithmetic
sum += entry->payload; // random read
}
if (local_id < HASH_TABLE_SIZE) {
globalHashTable[local_id].payload = sum; // rare coalesced write
}
}
I ran this kernel on a NVIDIA V100 card with multiple iterations. The variance of the results is very low, thus, I only plot one dot per group configuration. The input data size is 1 GiB and each thread processes 128 entries (BATCH = 128). Here are the results:
So far so good. The V100 has a max memory bandwidth of roughly 840GiB/sec and the measurements are close enough, given the fact that there are random memory reads involved.
Now I'm testing random writes to global memory with the following kernel:
__kernel void global_random_write_access(__global const uint* restrict groupColumn,
__global Entry* globalHashTable,
__const uint HASH_TABLE_SIZE,
__const uint HASH_TABLE_SIZE_BITS,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
// hash keys are pre-computed
uint hash_key = groupColumn[idx]; // coalesced read access
__global Entry* entry = &globalHashTable[hash_key]; // pointer arithmetic
sum += i;
entry->payload = sum; // random write
}
if (local_id < HASH_TABLE_SIZE) {
globalHashTable[local_id].payload = sum; // rare coalesced write
}
}
Godbolt: OpenCL -> PTX
The performance drops significantly to a few GiB/sec for few groups.
I can't make any sense of the behavior. As soon as the hash table reaches the size of L1 the performance seems to be limited by L2. For fewer groups the performance is way lower. I don't really understand what the limiting factors are.
The CUDA documentation doesn't say much about how store instructions are handled internally. The only thing I could find is that the st.wb PTX instruction (Cache Operations) might cause a hit on stale L1 cache if another thread would try to read the same addess via ld.ca. However, there are no reads to the hash table involved here.
Any hints or links to understanding the performance behavior are much appreciated.
Edit:
I actually found a bug in my code that didn't pre-compute the hash keys. The access to global memory wasn't random, but actually coalesced due to how I generated the values. I further simplified my experiments by removing the hash table. Now I only have one integer input column and one interger output column. Again, I want to see how the writes to global memory actually behave for different memory ranges. Ultimately, I want to understand which hardware properties influence the performance of writes to global memory and see if I can predict based on the code what performance to expect.
I tested this with two kernels that do the following:
Read from input, write to output
Read from input, read from output and write to output
I also applied two different access patterns, by generating the values in the group column:
SEQUENTIAL: sequentially increasing numbers until current group's size is reached. This pattern leads to a coalesced memory access when reading and writing from the output column.
RANDOM: uni-distributed random numbers within the current group's size. This pattern leads to a misaligned memory access when reading and writing from the output column.
(1) Read & Write
__kernel void global_write_access(__global const uint* restrict groupColumn,
__global uint *restrict output,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
uint group = groupColumn[idx]; // coalesced read access
sum += i;
output[group] = sum; // write (coalesced | random)
}
}
PTX Code: https://godbolt.org/z/19nTdK
(2) Read, Read & Write
__kernel void global_read_write_access(__global const uint* restrict groupColumn,
__global uint *restrict output,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
uint group = groupColumn[idx]; // coalesced read access
output[group] += 1; // read & write (coalesced | random)
}
}
PTX Code: https://godbolt.org/z/b647cz
As ProjectPhysX pointed out, the access pattern makes a huge difference. However, for small groups the performance is quite similar for both access patterns. In general, I would like to better understand the shape of the curves and which hardware properties, architectural features etc. influence this shape.
From the cuda programming guide I learned that global memory accesses are conducted via 32-, 64-, or 128-byte transactions. Accesses to L2 are done via 32-byte transactions. So up to 8 integer words can be accessed via a single transaction. This might explain the plateau with a bump at 8 groups at the beginning of the curve. After that more transactions are needed and performance drops.
One cache line is 128 bytes long (both on L1 and L2), hence, 32 intergers fit into a single cache line. For more groups more cache lines are required which can be potentially processed in parallel by more memory controllers. That might be the reason for the performance to increase here. 8 controllers are available on the V100 So I would expect the performance to peak at 256 groups. Though, it doesn't. Instead it will steadily increase performance until reaching 4096 groups and plateau there with roughly 750 GiB/sec.
The plateauing in your second performane plot is GPU saturation: For only a few work groups, the GPU is partly idle and the latencies involved in launching the kernel significantly reduce performance. Above 8192 groups, the GPU fully saturates its memory bandwidth. The plateau only is at ~520GB/s because of the misaligned writes (have low performance on the V100) and also the "rare coalesced write" in the if-block, which happens at least once per group. For branching within the group, all other threads have to wait for the single write operation to finish. Also this write is not coalesced, because it is not happening for every thread in the group. On the V100, misaligned write performance is very poor at max. ~120GB/s, see the benchmark here.
Note that if you would comment the if-part, the compiler sees that you do not do anything with sum and optimizes everything out, leaving you with a blank kernel in PTX.
The first performance graph to me is a bit more confusing. The only difference in the first kernel to the second is that the random wrtite in the loop is replaced by a random read. Generally, read performance on the V100 is much better (~840GB/s, regardless of coalesced/misaligned) than misaligned write performance, so performance is expected to be much better overall and indeed it is. However I can't make sense of the performance dropping for more groups, where saturation should theoretically be better. But the performance drop isn't really that significant at ~760GB/s vs. 730GB/s.
To summarize, you are observing that the performance penalty for misaligned writes (~120GB/s vs. ~900GB/s for coalesced writes) is much larger than for reads, where performance is about the same for coalesced/misaligned at ~840GB/s. This is common thing for GPUs, with some variance of course between microarchitectures. Typically there is at least some performance penalty for misaligned reads, but not as large as for misaligned writes.
I want to know the priority of index counting for the following code snippets (simple 2 dimensional matrix multiplication routine).
kernel void mmul(
const int N,
global float* A,
global float* B,
global float* C)
{
int k;
int i = get_global_id(0);
int j = get_global_id(1);
float tmp;
if ((i < N) && (j < N))
{
tmp = 0.0f;
for (k = 0; k < N; k++)
tmp += A[i*N+k] * B[k*N+j];
C[i*N+j] = tmp;
}
}
If you look inside the for loop with 'k' counter you can see global work-item 'i' and 'j' placed in the same line. I want to know which of them have priority in terms of counting the indexes (eg. 1,2,3,4, ... , n) of 'i' and 'j'. I don't understand how this would work as I am new to OpenCl and I would use nested for loop, if I am just using normal C or Python, for this type of operation.
Can someone explain how the global work-item work?
Thank you.
You should focus more on memory read/write priorities than workitem issuing order. To enforce a priority/order on memory operations, use mem_fence(in-workitem) , barrier(in-workgroup) and even kernels(all workitems sync point). Using deliberate empty for-loops or atomic functions cannot guarantee a memory-write/read priority. Only memory fences/barriers/kernels can.
There is no priority for any workitem(to start/end running) but they are grouped and executed on compute units which have many threads to run them. There is no guarantee that workitem i,j will execute before i+1,j+1 but there is a guarantee they will be executed in same compute unit(with cores sharing L1 cache) if they are in same workgroup(with size of 16,16 for example) when using Nvidia and Amd gpus.
Being executed in same compute unit increases chances of being issued at the same time which is not a priority ofcourse but sharing resources like L1 cache means high performance.
Even in same workgroup, there is no guarantee if a local workitem is issued before some other workitem but they are more likely happening at the same time if they are on same SIMD unit(such as 16-wide parts in Amd gpu).
This is my first post. I'll try to keep it short because I value your time. This community has been incredible to me.
I am learning OpenCL and want to extract a little bit of parallelism from the below algorithm. I will only show you the part that I am working on, which I've also simplified as much as I can.
1) Inputs: Two 1D arrays of length (n): A, B, and value of n. Also values C[0], D[0].
2) Outputs: Two 1D arrays of length (n): C, D.
C[i] = function1(C[i-1])
D[i] = function2(C[i-1],D[i-1])
So these are recursive definitions, however the calculation of C & D for a given i value can be done in parallel (they are obviously more complicated, so as to make sense). A naive thought would be creating two work items for the following kernel:
__kernel void test (__global float* A, __global float* B, __global float* C,
__global float* D, int n, float C0, float D0) {
int i, j=get_global_id(0);
if (j==0) {
C[0] = C0;
for (i=1;i<=n-1;i++) {
C[i] = function1(C[i-1]);
[WAIT FOR W.I. 1 TO FINISH CALCULATING D[i]];
}
return;
}
else {
D[0] = D0;
for (i=1;i<=n-1;i++) {
D[i] = function2(C[i-1],D[i-1]);
[WAIT FOR W.I. 0 TO FINISH CALCULATING C[i]];
}
return;
}
}
Ideally each of the two work items (numbers 0,1) would do one initial comparison and then enter their respective loop, synchronizing for each iteration. Now given the SIMD implementation of GPUs, I assume that this will NOT work (work items would be waiting for all of the kernel code), however is it possible to assign this type of work to two CPU cores and have it work as expected? What will the barrier be in this case?
This can be implemented in opencl, but like the other answer says, you're going to be limited to 2 threads at best.
My version of your function should be called with a single work group having two work items.
__kernel void test (__global float* A, __global float* B, __global float* C, __global float* D, int n, float C0, float D0)
{
int i;
int gid = get_global_id(0);
local float prevC;
local float prevD;
if (gid == 0) {
C[0] = prevC = C0;
D[0] = prevD = D0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for (i=1;i<=n-1;i++) {
if(gid == 0){
C[i] = function1(prevC);
}else if (gid == 1){
D[i] = function2(prevC, prevD);
}
barrier(CLK_LOCAL_MEM_FENCE);
prevC = C[i];
prevD = D[i];
}
}
This should run on any opencl hardware. If you don't care about saving all of the C and D values, you can simply return prevC and prevD in two floats rather than the entire list. This would also make it much faster due to sticking to a lower cache level (ie local memory) for all reading and writing of the intermediate values. The local memory boost should also apply to all opencl hardware.
So is there a point to running this on a GPU? Not for the parallelism. You are stuck with 2 threads. But if you don't need all values of C and D returned, you would probably see a significant speed up because of the much faster memory of GPUs.
All of this assumes that function1 and function2 aren't overly complex. If they are, just stick to CPUs -- and probably another multiprocessing technique such as OpenMP.
Dependency in your case is completely linear/recursive (i needs i-1). Not even logaritmic like other problems (reduction, sum, sort, etc.). And therefore this problem does not fit well in a SIMD device.
The best you can do is go a 2 threads approach in CPU. Thread 1 will "produce" data (C value), for thread 2.
A very naive approach for example:
Thread 1:
for(){
ProcessC(i);
atomic_inc(counter); //This function should unlock
}
Thread 2:
for(){
atomic_dec(counter); //This function should lock
ProcessD(i);
}
Where atomic_inc and atomic_dec can be implemented with counting semaphores for example.
I'm using PyOpenCL to let my GPU do some regression on a large data set. Right now the GPU is slower than the CPU, probably because there is a loop that requires access to the global memory during each increment (I think...). The data set is too large to store into the local memory, but each loop does not require the entire data set, so I want to copy a portion of this array to the local memory. My question is: how do I do this? In Python one can easily slice a portion, but I don't think that's possible in OpenCL.
Here's the OpenCL code I'm using, if you spot any more potential optimisations, please shout:
__kernel void gpu_slope(__global double * data, __global double * time, __global int * win_results, const unsigned int N, const unsigned int Nmax, const double e, __global double * result) {
__local unsigned int n, length, leftlim, rightlim, i;
__local double sumx, sumy, x, y, xx, xy, invlen, a, b;
n = get_global_id(0);
leftlim = win_results[n*2];
rightlim = win_results[n*2+1];
sumx = 0;
sumy = 0;
xy = 0;
xx = 0;
length = rightlim - leftlim;
for(i = leftlim; i <= rightlim; i++) {
x = time[i]; /* I think this is fetched from global memory */
y = data[i];
sumx += x;
sumy += y;
xy += x*y;
xx += x*x;
}
invlen = 1.0/length;
a = xy-(sumx*sumy)*invlen;
b = xx-(sumx*sumx)*invlen;
result[n] = a/b;
}
I'm new to OpenCL, so please bear with me. Thanks!
The main(ish) point in GPU computing is trying to utilize hardware parallelism as much as possible. Instead of using the loop, launch a kernel with a different thread for every one of the coordinates. Then, either use atomic operations (the quick-to-code, but slow-performance option), or parallel reduction, for the various sums.
AMD has A tutorial on this subject. (NVidia does too, but theirs would be CUDA-based...)
You will find examples copying to local memory in PyOpenCL's examples folder: https://github.com/inducer/pyopencl/tree/master/examples
I recommend you read, run, and customize several of these examples to learn.
I also recommend the Udacity parallel programming course: https://www.udacity.com/course/cs344 This course will help solidify your grasp of fundamental OpenCL concepts.
I have the following opencl kernel function to get the column sum of a image.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols)
{
int srcIdx = x ;
int dstIdx = x ;
float sum = 0;
for (int y = 0; y < srcRows; ++y)
{
sum += src[srcIdx];
dst[dstIdx] = sum;
srcIdx += srcStep;
dstIdx += dstStep;
}
}
}
I assign that each thread process a column here so that a lot of threads can get the column_sum of each column in parallel.
I also use float4 to rewrite the above kernel so that each thread can read 4 elements in a row at one time from the source image, which is shown below.
__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
int srcRows,int srcStep,int dstStep)
{
const int x = get_global_id(0);
srcStep >>= 2;
dstStep >>= 2;
if (x < srcCols/4)
{
int srcIdx = x ;
int dstIdx = x ;
float4 sum = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
for (int y = 0; y < srcRows; ++y)
{
float4 temp2;
temp2 = vload4(0, &src[4 * srcIdx]);
sum = sum + temp2;
vstore4(sum, 0, &dst[4 * dstIdx]);
srcIdx += (srcStep/4);
dstIdx += (dstStep/4);
}
}
}
In this case, theoretically, I think the time consumed by the second kernel to process a image should be 1/4 of the time consumed by the first kernel function. However, no matter how large the image is, the two kernels almost consume the same time. I don't know why. Can you guys give me some ideas? T
OpenCL vector data types like float4 were fitting better the older GPU architectures, especially AMD's GPUs. Modern GPUs don't have SIMD registers available for individual work-items, they are scalar in that respect. CL_DEVICE_PREFERRED_VECTOR_WIDTH_* equals 1 for OpenCL driver on NVIDIA Kepler GPU and Intel HD integrated graphics. So adding float4 vectors on modern GPU should require 4 operations. On the other hand, OpenCL driver on Intel Core CPU has CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT equal to 4, so these vectors could be added in a single step.
You are directly reading the values from "src" array (global memory). Which typically is 400 times slower than private memory. Your bottleneck is definitelly the memory access, not the "add" operation itself.
When you move from float to float4, the vector operation (add/multiply/...) is more efficient thanks to the ability of the GPU to operate with vectors. However, the read/write to global memory remains the same.
And since that is the main bottleneck, you will not see any speedup at all.
If you want to speed your algorithm, you should move to local memory. However you have to manually resolve the memory management, and the proper block size.
which architecture do you use?
Using float4 has higher instruction level parallelism (and then require 4 times less threads) so theoretically should be faster (see http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf)
However did i understand correctly in you kernel you are doing prefix-sum (you store the partial sum at every iteration of y)? If so, because of the stores the bottleneck is at the memory writes.
I think on the GPU float4 is not a SIMD operation in OpenCL. In other words if you add two float4 values the sum is done in four steps rather than all at once. Floatn is really designed for the CPU. On the GPU floatn serves only as a convenient syntax, at least on Nvidia cards. Each thread on the GPU acts as if it is scalar processor without SIMD. But the threads in a warp are not independent like they are on the CPU. The right way to think of the GPGPU models is Single Instruction Multiple Threads (SIMT).
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Have you tried running your code on the CPU? I think the code with float4 should run quicker (potentially four times quicker) than the scalar code on the CPU. Also if you have a CPU with AVX then you should try float8. If the float4 code is faster on the CPU than float8 should be even faster on a CPU with AVX.
try to define __ attribute __ to kernel and see changes in run timing
for example try to define:
__ kernel void __ attribute__((vec_type_hint(int)))
or
__ kernel void __ attribute__((vec_type_hint(int4)))
or some floatN as you want
read more:
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/functionQualifiers.html