I'm playing around with an example on opencl:
__kernel void atomic(__global int* x) {
__local int a, b;
a = 0; b = 0;
a++;
atomic_inc(&b);
x[0] = a;
x[1] = b;
x[2]++;
atomic_inc(x+3);
}
Running this code with global_size = 1024 and workgroup_size = 8, this is the following output:
[1 8 1 1024]
I can understand what is happening for all cases except the value given for x[1]. Why is the value of x[1] not 1024 but 8?
Under x[1] is stored value of b which is a variable residing in __local address space meaning the variable is shared by all work items within a workgroup. Each of workgroup have b initialized to 0 and atomically incremented to 8 because workgroup size is 8 (each work item increments by 1).
Related
Given an undirected graph with positive edge costs, choose a subset of edges such that there are no cycles and the sum of the cost is maximal.
The input consists of several graphs, each defined with number of vertices n, number of edges m, and m triples x,y,c to indicate an edge between x and y of cost c. The vertices are numbered from 0 to n - 1. It is assumed that 1 ≤ n ≤ 104, 0 ≤ m ≤ 5 n, and 1 ≤ c ≤ 105. there may be more than one edge between two vertices, and even edges with x = y.
#include <iostream>
#include <vector>
using namespace std;
using P = pair<int,int>;
using VE = vector<int>;
using VP = vector<P>;
using VVE = vector<VP>;
int n,m;
VVE G;
VE cost;
VE vist;
VE pare;
int maxim(int x){
if(cost[x] != -1) return cost[x];
cost[x] = 0;
for(P y: G[x]){
if(cost[x] <= y.second + maxim(y.first)){
cost[x] = y.second + maxim(y.first);
}
}
return cost[x];
}
int main() {
while(cin >> n >> m){
G = VVE(n);
cost = VE(n,-1);
pare = VE(n,-1);
for(int i = 0; i < m; ++i){
int x,y,c; cin >> x >> y >> c;
G[x].push_back(P(y,c));
G[y].push_back(P(x,c));
}
int mx = -1;
for(int i = 0; i < n; ++i){
if(mx <= maxim(i)){
mx = maxim(i);
}
}
cout << mx << endl;
}
}
This is my code and I don't know how to solve the problem. I would appreciate help. As you can see the graph is read as a vector of vectors. In which each pair indicates that node x goes to node y with cost c.
As a commenter pointed out, this is the maximum spanning tree problem (which is the same as the minimum spanning tree problem, just negate the costs). That problem can be solved with the greedy algorithm. Initially place every node into a heap on its own. Then in a loop consider the edges in decreasing cost order. If the two endpoints of the considered edge are in the same heap, discard the edge. Otherwise select it and merge the heaps. When you have only one heap left, you can stop, and the selected edges form your solution.
I am trying to compute local sum of each group by identify with 3d volume position and group ID.
My idea is divide space into groups and use atomic_add to compute local_sum.
But because I am new to parallel computing so it is kind of hard to find the correlation between codes and instructions.
My current kernel is like:
__kernel void TestAtomicAddLocal(__global *int src, int3 size, __global int *res)
{
int x = get_global_id(0);
int y = get_global_id(1);
int z = get_global_id(2);
if( x >= vol_dim.x || y >= vol_dim.y || z >= vol_dim.z ){ return; }
int id = x + y * vol_dim.x + z * vol_dim.x * vol_dim.y;
// local mem shared by all work items in work group,
//so this can be accessed by all items in current workgroup
__local int local_sum;
local_sum= 0;
// use global_id to access the value of input array
int local_offset = atomic_add(&local_sum, src[id]);
barrier(CLK_LOCAL_MEM_FENCE);
int global_offset = atomic_add(&num_verts[0], local_sum);
barrier(CLK_GLOBAL_MEM_FENCE);
}
For the host part, my setting is
enqueueNDrangeKernel( cq, kn_testAtomicAddLocal, 3, 0, cl::size3(256,256,256), cl::size3(64, 64, 64), 0, 0, 0);
For kenrnel arguments, the *src is cl_mem with size 256*256*256*sizeof(cl_int), size is 4 * sizeof(cl_int), and *res is cl_mem with size 4*sizeof(int).
Then I get error that CL_OUT_OF_RESOURCE and CL_INVALID_GROUP_SIZE, from my understanding, my device max group size is 1024, but here total group = (256/64)^3 = 64 < 1024.
My gpu max work item size is 1024x1024x64 which is also ok. So I think there must something I understand is wrong. I hope someone could help me out.
The max group size limits your 64 * 64 * 64 part.
And I guess you are using a CUDA card. You'd better use CUDA on CUDA cards. OpenCL is more or less emulated on CUDA cards. If you are not, I thinks all AMD cards have a group size limit of 256. edit: Um... I forgot about Intel ones. If so, ignore this part.
One more important thing, you'd better check some reduction logic implementation example on the internet first. Atomics are very expesive and using them like what you did would almost certain make your GPU code slower than CPU ones.
I'm calling the kernel below with GlobalWorkSize 64 4 1 and WorkGroupSize 1 4 1 with the argument output initialized to zeros.
__kernel void kernelB(__global unsigned int * output)
{
uint gid0 = get_global_id(0);
uint gid1 = get_global_id(1);
output[gid0] += gid1;
}
I'm expecting 6 6 6 6 ... as the sum of the gid1's (0 + 1 + 2 + 3). Instead I get 3 3 3 3 ... Is there a way to get this functionality? In general I need the sum of the results of each work-item in a work group.
EDIT: It seems it must be said, I'd like to solve this problem without atomics.
You need to use local memory to store the output from all work items. After the work items are done their computation, you sum the results with an accumulation step.
__kernel void kernelB(__global unsigned int * output)
{
uint item_id = get_local_id(0);
uint group_id = get_group_id(0);
//memory size is hard-coded to the expected work group size for this example
local unsigned int result[4];
//the computation
result[item_id] = item_id % 3;
//wait for all items to write to result
barrier(CLK_LOCAL_MEM_FENCE);
//simple O(n) reduction using the first work item in the group
if(local_id == 0){
for(int i=1;i<4;i++){
result[0] += result[i];
}
output[group_id] = result[0];
}
}
Multiple work items are accessing elements of global simultaneously and the result is undefined. You need to use atomic operations or write unique location per work item.
i am new in OpenCL and i am trying to compute histogram of grayscaled image. I am performing this computation on GPU nvidia GT 330M.
code is
__kernel void histogram(__global struct gray * input, __global int * global_hist, __local volatile int * histogram){
int local_offset = get_local_id(0) * 256;
int histogram_global_offset = get_global_id(0) * 256;
int offset = get_global_id(0) * 1920;
int value;
for(unsigned int i = 0; i < 256; i++){
histogram[local_offset + i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 1920; i++){
value = input[offset + i].i;
histogram[local_offset + value]++;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 256; i++){
global_hist[histogram_global_offset + i] = histogram[local_offset + i];
}
}
This computation is performed on image 1920*1080.
I am firing kernels with
queue.enqueueNDRangeKernel(kernel_histogram, cl::NullRange, cl::NDRange(1080), cl::NDRange(1));
When local size of histogram is set to 256 * sizeof(cl_int) speed of this computation is (through nvidia nsight performance analysis) 11 675 microseconds.
Because local workgroup size is set to one. I tried increase local workgroup size to 8. But when i increase local size of histogram to 256 * 8 * sizeof(cl_int) and compute with local wg size 1. I get 85 177 microseconds.
So when i fire it with 8 kernels per workgroup i dont get speedup from 11ms but from 85ms. So final speed with 8 kernels per worgroup is 13 714 microseconds.
But when i create computation bug, set local_offset to zero and size of local histogram is 256 * sizeof(cl_int) and use 8 kernels per workgroup i get much better time - 3 854 microsec.
Does anybody have some ideas to speed up this computation ?
Thanks!
This answer assumes you want to eventually reduce your histogram all the way down to 256 int values. You call the kernel with as many work groups as you have compute units on your device, and group size should be (as always) a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE on the device.
__kernel void histogram(__global struct gray * input, __global int * global_hist){
int group_id = get_group_id(0);
int num_groups = get_num_groups(0);
int local_id = get_local_id(0);
int local_size = get_local_size(0);
volatile __local int histogram[256];
int i;
for(i=local_id; i<256; i+=local_size){
histogram[i] = 0;
}
int rowNum, colNum, value, global_hist_offset
for(rowNum = group_id; rowNum < 1080; rowNum+=num_groups){
for(colNum = local_id; colNum < 1920; colNum += local_size){
value = input[rowNum*1920 + colNum].i;
atomic_inc(histogram[input]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
global_hist_offset = group_id * 256;
for(i=local_id; i<256; i+=local_size){
global_hist[global_hist_offset + i] = histogram[i];
}
}
Each work group works cooperatively on one row of the image at a time. Then the group moves on to another row, calculated using the num_groups value. This will work well no matter how many groups you have. For example, if you have 7 compute units, group 3 (the forth group) will start on row 3 in the image, and then every 7th row thereafter. Group 3 would compute 153 rows in total, and its final row would be row 1074. Some work groups may compute 1 more row -- groups 0 and 1 in this example.
The same interlacing is done within the work group when looking at the columns of the image. in the colNum loop, the Nth work item starts at column N, and skips ahead by local_size columns. The remainder for this loop shouldn't come in to play as often, because CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will likely be a factor of 1920. Try all work group sizes from (1..X) * CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE up to the maximum work group size for your device.
One final point about this kernel: the results are not identical to your original kernel. Your global_hist array is 1080 * 256 integers. The one I have needs to be num_groups * 256 integers. This helps if you want a full reduction, because there is much less to add after the kernel executes.
I have taken the Kernel from the great OpenCL SpMV article for AMD by Bryan Catanzaro.
I have given it a toy problem where the input is
A= [0 0 0 6 1 3 5 7 2 4 0 0]
offsets= [-3 0 2]
x= [1 2 3 4]
and the output y should be [7 22 15 34]
Here is the kernel:
__kernel
void dia_spmv(__global float *A, __const int rows,
__const int diags, __global int *offsets,
__global float *x, __global float *y) {
int row = get_global_id(0);
float accumulator = 0;
for(int diag = 0; diag < diags; diag++) {
int col = row + offsets[diag];
if ((col >= 0) && (col < rows)) {
float m = A[diag*rows + row];
float v = x[col];
accumulator += m * v;
}
}
y[row] = accumulator;
}
After loading and writing the input arguments I execute the kernel like this:
size_t global_work_size;
global_work_size = 4;
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global_work_size,NULL, 0, NULL, NULL);
err = clFinish(cmd_queue);
And I get the correct result when I read y back from gpu memory.
I.e. I get y = [7 22 15 34]
I am new to OpenCL (and GPGPU in general) so I want to try and understand how to extend the problem correctly for much larger matrices of arbitrary dimension.
So lets say I have 1000 000 rows. What should I set global_work_size to be?
And should I set local_work_size or should I leave it as NULL?
To use the kernel for arbitrary matrix sizes you should think about the problem and rewrite the kernel. The issue is the limited memory size of the GPU and limited size for a single buffer. You can get the maximum size for a buffer with clGetDeviceInfo and CL_DEVICE_MAX_MEM_ALLOC_SIZE.
You need to split your problem into smaller pieces. Calculate them separately and merge the results afterwards.
I do not know the problem above and can not give you any hint which helps you to implement this. I can only give you the general direction.