I need to add sum of two float in vector type in for loop in opencl
float16 temp;
for( int j = 0; j <16; j++)
{
float sum = row * column; // Row x column
temp = sum;
}
I need output something like temp = sum0, sum1.....
Here regular float is the way to go. While you can address the components of float16 individually with .s0, .s1, ..., .sf, you can not automate this in a loop with a loop index (see the OpenCL 1.2 Reference Guide). The only way with float16 would be to manually write it down for every loop iteration, which is not practical.
However with float temp[16]; you can do exactly that. Your code then would look like this:
float temp[16];
for(int j=0; j<16; j++) {
float sum = row * column; // Row x column
temp[j] = sum;
}
To convert temp into float16, you would have to do it manually for every component, which again is tedious:
float16 temp2 = (float16)(temp[0], temp[1], ..., temp[16]);
I would advise you to get rid of float16 entirely in your particular application. Using regular float* arrays is more practical (automatic indexing in loops) and also faster when you access the data from global memory: You can use Structure-of-Arrays and are not bound to the much slower Array-of-Structures data layout.
Related
I am trying to copy a 2 dimensional array to another 2 dimensional array. Since the name (srcAry) is the address of the first element of the source array, I have been able to print out all the values in the source array using pointer arithmetic in a for loop. I am using the number of rows times the number of columns as the condition to stop looping. If I try to assign the values to the new array using this method I get an error message (error: assignment to expression with array type). Is this possible to do this or am I limited to using two nested for loops with indexes?
...
void copyAry(double *pAry, int numRows, int numCols)
{
double newAry[numRows][numCols];
int end = numRows * numCols;
int ctr = 0;
for( ; ctr < end; ctr++)
// printf("*(pAry + %d) = %.1f\n", ctr, *(pAry + ctr)); //this works fine
{
*(newAry + ctr) = *(pAry + ctr); //this is where I receive error
}
return;
}
...
Thanks in advance.
I would assume that the type of newAry + ctr is not double* as your code assumes, but rather double*[numCols] i.e. a pointer to an array of numCols elements. Which also means that you would advance not one element at a time, but numCols.
Usually you would use memcpy for this kind of low level data copying. Barring that, you might start with double* pNewAry = &newAry[0][0] or some such in order to test the 2d array as a linear sequence of doubles.
I'm updating a single element in a buffer from two lanes and need an atomic for float4 types. (More specifically, I launch twice as many threads as there are buffer elements, and each successive pair of threads updates the same element.)
For instance (this pseudocode does nothing useful, but hopefully illustrates my issue):
int idx = get_global_id(0);
int mapIdx = floor (idx / 2.0);
float4 toAdd;
// ...
if (idx % 2)
{
toAdd = (float4)(0,1,0,1);
}
else
{
toAdd = float3(1,0,1,0);
}
// avoid race condition here?
// I'd like to atomic_add(map[mapIdx],toAdd);
map[mapIdx] += toAdd;
In this example, map[0] should be incremented by (1,1,1,1). (0,1,0,1) from thread 0, and (1,0,1,0) from thread 1.
Suggestions? I haven't found any reference to vector atomics in the CL documents. I suppose I could do this on each individual vector component separately:
atomic_add(map[mapIdx].x, toAdd.x);
atomic_add(map[mapIdx].y, toAdd.y);
atomic_add(map[mapIdx].z, toAdd.z);
atomic_add(map[mapIdx].w, toAdd.w);
... but that just feels like a bad idea. (And requires a cmpxchg hack since there are no float atomics.
Suggestions?
Alternatively you could try using local memory like that:
__local float4 local_map[LOCAL_SIZE/2];
if(idx < LOCAL_SIZE/2) // More optimal would be to use work items together than every second (idx%2) as they work together in a warp/wavefront anyway, otherwise that may affect the performance
local_map[mapIdx] = toAdd;
barrier(CLK_LOCAL_MEM_FENCE);
if(idx >= LOCAL_SIZE/2)
local_map[mapIdx - LOCAL_SIZE/2] += toAdd;
barrier(CLK_LOCAL_MEM_FENCE);
What will be faster - atomics or local memory - or possible (size of local memory may be too big) depends on actual kernel, so you will need to benchmark and choose the right solution.
Update:
Answering your question from comments - to write later back to global buffer do:
if(idx < LOCAL_SIZE/2)
map[mapIdx] = local_map[mapIdx];
Or you can try without introducing local buffer and write directly into global buffer:
if(idx < LOCAL_SIZE/2)
map[mapIdx] = toAdd;
barrier(CLK_GLOBAL_MEM_FENCE); // <- notice that now we use barrier related to global memory
if(idx >= LOCAL_SIZE/2)
map[mapIdx - LOCAL_SIZE/2] += toAdd;
barrier(CLK_GLOBAL_MEM_FENCE);
Aside from that I can see now problem with indexes. To use the code from my answer the previous code should look like:
if(idx < LOCAL_SIZE/2)
{
toAdd = (float4)(0,1,0,1);
}
else
{
toAdd = (float4)(1,0,1,0);
}
If you need to use id%2 though then all the code must follow this or you will have to do some index arithmetic so that the values go into right places in map.
If I understand issue correctly I would do next.
Get rid of ifs by making array with offsets
float4[2] = {(1,0,1,0), (0,1,0,1)}
and use idx %2 as offset
move map into local memory and use mem_fence(CLK_LOCAL_MEM_FENCE) to make sure all threads in group synced.
I am in trouble passing values between host code and kernel code due to some vector data types. The following code/explanation is just for referencing my problem, my code is much bigger and complicated. With this small example, hopefully, I will be able to explain where I am having a problem. I f anything more needed please let me know.
std::vector<vector<double>> output;
for (int i = 0;i<2; i++)
{
auto& out = output[i];
sum =0;
for (int l =0;l<3;l++)
{
for (int j=0;j<4; j++)
{
if (some condition is true)
{ out[j+l] = 0.;}
sum+= .....some addition...
}
out[j+l] = sum
}
}
Now I want to parallelize this code, from the second loop. This is what I have done in host code:
cl::buffer out = (context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, output.size(), &output, NULL)
Then, I have set the arguments
cl::SetKernelArg(0, out);
Then the loop,
for (int i = 0,i<2, i++)
{
auto& out = output[i];
// sending some more arguments(which are changing accrding to loop) for sum operations
queue.enqueueNDRangeKernel(.......)
queue.enqueuereadbuffer(.....,&out,...)
}
In Kernel Code:
__kernel void sumout(__global double* out, ....)
{
int l = get_global_id(0);
int j = get_global_id(1);
if (some condition is true)
{ out[j+l] = 0.; // Here it goes out of the loop then
return}
sum+= .....some addition...
}
out[j+l] = sum
}
So now, in if condition out[j+l] is getting 0 in the loop. So out value is regularly changing. In normal code, it is a reference pointer to a vector. I am not able to read the values in output from out during my kernel and host code. I want to read the values in output[i] for every out[j+l]. But I am confused due this buffer and vector.
just for more clarification,output is a vector of vector and out is reference vector to output vector. I need to update values in output for every change in out. Since these are vectors, I passed out as cl buffer. I hope it is clear.
Please let me know, if the code is required, I will try to provide as much as I can.
You are sending pointers of vectors to opencl(ofcourse they are contiguous on pointer level) but whole data is not contiguous in memory since each inner vector points to different memory area. Opencl cannot map host pointers to device memory and there is no such command in this api.
You could use vector of arrays(latest version) or pure arrays.
I've got a QVector of QVector. And I want to collect all elements in all QVectors to form a new QVector.
Currently I use the code like this
QVector<QVector<T> > vectors;
// ...
QVector<T> collected;
for (int i = 0; i < vectors.size(); ++i) {
collected += vectors[i];
}
But it seems the operator+= is actually appending each element to the QVector. So is there a more time-efficent usage of QVector or a better suitable type replace QVector?
If you really need to, then I would do something like:
QVector< QVector<T> > vectors = QVector< QVector<T> >();
int totalSize = 0;
for (int i = 0; i < vectors.size(); ++i)
totalSize += vectors.at(i).size();
QVector<T> collected;
collected.reserve(totalSize);
for (int i = 0; i < vectors.size(); ++i)
collected << vectors[i];
But please take note that this sounds a bit like premature optimisation. As the documentation points out:
QVector tries to reduce the number of reallocations by preallocating up to twice as much memory as the actual data needs.
So don't do this kind of thing unless you're really sure it will improve your performance. Keep it simple (like your current way of doing it).
Edit in response to your additional requirement of O(1):
Well if you're randomly inserting it's a linked list but if you're just appending (as that's all you've mentioned) you've already got amortized O(1) with the QVector. Take a look at the documentation for Qt containers.
for (int i = 0; i < vectors.size(); ++i) {
for(int k=0;k<vectors[i].size();k++){
collected.push_back(vectors[i][k]);
}
}
outer loop: take out each vector from vectors
inner loop: take out each element in the i'th vector and push into collected
You could use Boost Multi-Array, this provides a multi-dimensional array.
It is also a 'header only' library, so you don't need to separately compile a library, just drop the headers into a folder in your project and include them.
See the link for the tutorial and example.
I have two questions regarding the multidimensional arrays. I declared a 3D array using two stars but when I try to access the elements I get a used-without-initializing error.
unsigned **(test[10]);
**(test[0]) = 5;
Howcome I get that error while when I use the following code, I don't get an error - What's the difference?
unsigned test3[10][10][10];
**(test3[0]) = 5;
My second question is this: I'm trying to port a piece of code that was written for Unix to Windows. One of the lines is this:
unsigned **(precomputedHashesOfULSHs[nnStruct->nHFTuples]);
*nHFTuples is of type int but it's not a constant, and this the error that I'm getting;
error C2057: expected constant expression
Is it possible that I'm getting this error because I'm running it on Windows not Unix? - and how would I solve this problem? I can't make nHFTuples a constant because the user will need to provide the value for it!
In the first one, you didn't declare a 3D array, you declared an array of 10 pointers to pointers to unsigned ints. When you dereference it, you're dereferencing a garbage pointer.
In the second one, you declared the array correctly but you're using it wrong. Arrays are not pointers and you don't dereference them.
Do this:
unsigned test3[10][10][10];
test3[0][0][0] = 5;
To answer your second question, you have to use a number that can be known at compile time as the size of an array. GCC has a nonstandard extension that allows you to do that, but it's not portable and not part of the standard (though C99 introduced them). To fix it, you'll have to use malloc and free:
int i, j, k;
unsigned*** precomputedHashOfULSHs = malloc(nnStruct->nHFTuples * sizeof(unsigned));
for (i = 0; i < firstDimensionLength; ++i) {
precomputedHashOfULSHs[i] = malloc(sizeOfFirstDimension * sizeof(unsigned));
for (j = 0; j < secondDimensionLength; ++j) {
precomputedHashOfULSHs[i][j] = malloc(sizeOfSecondDimension * sizeof(unsigned));
for (k = 0; k < sizeOfSecondDimension; ++k)
precomputedHashOfULSHs[i][j][k] = malloc(sizeof(unsigned));
}
}
// then when you're done...
for (i = 0; i < firstDimensionLength; ++i) {
for (j = 0; j < secondDimensionLength; ++j) {
for (k = 0; k < sizeOfSecondDimension; ++k)
free(precomputedHashOfULSHs[i][j][k]);
free(precomputedHashOfULSHs[i][j]);
}
free(precomputedHashOfULSHs[i]);
}
free(precomputedHashOfULSHs);
(Pardon me if that allocation/deallocation code is wrong, it's late :))
Although you don't specify it, I think you're using a compiler on unix that supports C99 (SUch as GCC), whereas the compiler you use on windows does not support it. (Visual Studio uses only C89 here).
You have three options:
You can hard-code a suitable maximum array size.
You could allocate the array yourself using malloc or calloc. Don't forget to free it when you're done.
Port the program to C++, and use std::vector.
If you choose option 3, then you'll want something like:
std::vector<unsigned int> precomputedHashOfULSHs;
For a single-dimension vector, or for a two-dimensional vector, use:
std::vector<std::vector<unsigned int> > precomputedHashOfULSHs;
Do note that vectors default to being empty, of zero length, so you will need to add each element from the original set.
In the case of test3 as an example, you'll want:
std::vector<std::vector<std::vector<unsigned int> > > precomputedHashOfULSHs;
precomputedHashOfULSHs.resize(10);
for(int i = 0; i < 10; i++) {
precomputedHashOfULSHs[i].resize(10);
for(int ii=0; ii<10; ii++) {
precomputedHashOfULSHs[i][ii].resize(10);
}
}
I haven't tested this code, but it should be right. C++ will manage the memory of that vector for you automatically.