I optimize algorithms using OpenCL and I want to vectorise the kernel.
Is vloadn / vstoren slower then a simple cast to needed vector in case of aligned data?
As far as I know vloadn / vstoren does not provide faster access. It only provides helper functionality to access the memory.
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
The spec is your firend.
Basically the only difference between casting and vloadn is that vloadn allows you to perform unaligned loads (unaligned to the vector size, the primitive must be aligned). If you perform simple cast to float4 the pointer must be aligned to 4*sizeof(float) boundary. If you perform vloadn for float4 the pointer must be aligned to sizeof(float) boundary.
So one might expect the pure cast to be faster as it does not have to check for alignment at all, and in no case whatsoever could the pure cast be slower, unless the implementation is bugged.
tl;dr: vloadn is probably slower than direct cast in case of aligned data.
Related
I would like to know and understand how can one declare half-precision buffers and pointers in SYCL namely in the following ways -
Via the buffer class.
Using malloc_device() function.
Also, suppose I have an existing fp32 matrix / array on the host side. How can I copy it's contents to fp16 memory on the GPU side.
TIA
For half-precision, you can just use sycl::half as the template parameter for either of these.
accHalf[i] = static_cast<sycl::half>(accFloat[i]);
For copying, you'll need to convert the data from fp32 to fp16, which you could probably do using a kernel to perform the conversion. This seems to be a well documented problem with solutions, see this thread.
What are the downsides of using Bigarray when interfacing with C is not an issue? Are they slower, in particular for small 2D matrices?
Just based on looking through the implementations, I'd say that bigarrays might be slower if you create large numbers of short-lived arrays. It looks like the memory for them is managed outside the usual OCaml GC, which handles short-lived objects extremely well.
You also might find that accesses to bigarrays aren't inlined, whereas accesses to the built-in arrays would be.
On the other hand, built-in arrays are going to have an extra indirection for two-dimensions.
If the performance really matters, you'll probably have to benchmark your particular application.
The main downside is right there in the type - bigarrays can hold only small subset of primitive types
There are lots of real-world reasons you'd want to do this. Ours is because we have a list of variable length data structures, and we want to be able to change the size of one of the elements without recopying them all.
Here's a few things I've tried:
Just have a lot of kernel arguments. Sure, sounds hacky, but works for small N. This is actually what we've been doing.
Do 1) with some sort of macro loop which extends the kernel args to the max size (which I think is device dependent). I don't really want to do this... it sounds bad.
Create some sort of list of structs which contain pointers, and fill it before your kernel invocation. I tried this, and I think it violates the spec. According to what I've seen on the nVidia forums, preserving the address of a device pointer beyond one kernel invocation is illegal. If anyone can point to where in the spec it says this, I'd love to know, because I can't find it. However, this definitely breaks on ATI hardware, as it moves the objects around.
Give up, store the variable sized objects in a big array, and write a clever algorithm to use empty space so the whole array must be reflowed less often. This will work, but is an inelegant, complicated design. Also, it requires lots of scary pointer arithmetic...
Does anyone else have other ideas? What about experiences trying to do this; is there a least hacky way? Why?
To 3:
OpenCL 1.1 spec page 193 says "Arguments to kernel functions in a program cannot be declared as a pointer to a pointer(s)."
Struct containing a pointer to pointer (pointer to a buffer object) might not be against strict reading of this sentence but it's within the spirit: No pointers to buffer objects may be passed as arguments from host code to kernel even if they're hidden inside a user defined struct.
I'd opt for option 5: Do not use variable size data structures. If you have any way of making them constant size by all means do it. It will make your life a whole lot easier. To be precise there is no 'variable size structure'. Every struct definition produces constant sized structs, so if the size has changed then the struct itself has changed and therefore requires another mem object. Every pointer passed to kernel function must have a single type.
In addition to sharpnelis answer option 5:
If the objects have similar size you could use unions on the biggest possible object size. But make sure you use explicit alignment. Pass a second buffer identifying the union used in each object in your variable-sized-objects-in-static-size-union buffer.
I reverted to this when using opencl lib code that only allowed one variable array of arbitrary type. I simply used cl_float2 to pass two floats. Since the cl_floatN types are implemented as unions - what works for the build in types will work for you as well.
hi i'm doing a program that needs high performance of handling vector elements
vector<Class_A> object ;
1- which one is fastest to access the elements
2- which is more code simpler and not complex to deal with
index ? iterator ? pointer ?
An iterator or pointer will have the same performance on most implementations -- usually a vector iterator is a pointer. An index needs to calculate a pointer each time, but the optimizer can sometimes take care of that. Generally though, as another commenter said, there's no sense in optimizing this for performance.
All of that said, I would probably go with an iterator since it's easier to change to another type of container if need be.
Assuming you have inlining enabled, and arent doing index range checking, they will probably all be about the same. Besides micro optimizing this probably isn't going to gain you anything, you need to optimize globally. profile your process and target the slowest parts first
I would like to know if it is more efficient to use a int ** than a QList > or if they are pretty much equal. I have to do alot of calculations so I might want to get to faster one
The difference in speed depends on the operations you are doing. QList is safer because it automatically allocates and deallocates its storage.
Worry about your program being correct first, then worry about performance, and always profile first before you optimize.
Here is a chart with complexity of Qt containers depending on their use case :
http://qt.nokia.com/doc/4.6/containers.html#algorithmic-complexity
Maybe it will help you !
If I refer to QList documentation:
Internally, QList is represented as an array of pointers to items of type T
Ref: http://qt.nokia.com/doc/4.6/qlist.html#details
So, it seems to be pretty equivalent. If you want to be sure, you can look at the source code or write a benchmark using QTestLib.