I have a code that contains know-how I would not like to distribute in source code. One of solutions is to provide a bunch of pre-compiled kernels and choose the correct binary depending on the user's hardware.
How to cover most of the users (AMD and Intel, as Nvidia can use CUDA code) with minimum of the binaries and minimum of the machines where I have to run my offline compiler? Are there families of GPUs that can use the same binaries? CUDA compiler can compile for different architectures, what about OpenCL? Binary compatibility data doesn't seem well documented but maybe someone collected these data for himself.
I know there's SPIR but older hardware doesn't support it.
Here are details of my implementation if someone found this question and did less than I do. I made a tool that compiles the kernel to the file and then I collected all these binaries into a C array to be included into the main application:
const char* binaries[] = {
//kernels/HD Graphics 4000
"\x62\x70\x6c\x69\x73\x74\x30\x30\xd4\x01\x02\x03"
"\x04\x05\x06\x07\x08\x5f\x10\x0f\x63\x6c\x42\x69"
"\x6e\x61\x72\x79\x56\x65\x72\x73\x69\x6f\x6e\x5c"
...
"\x00\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00\x00"
"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x47\xe0"
,
//here more kernels
};
size_t binaries_sizes[] = {
204998,
205907,
...
};
And then I use the following code which iterates all the kernels (I didn't invent anything more clever than trial-and-error, choosing the first kernel that builds successfully, probably there's better solution):
int e3 = -1;
int i = 0;
while (e3 != CL_SUCCESS) {
if (i == lenof(binaries)) {
throw Error();
}
program = clCreateProgramWithBinary(context, 1, &deviceIds[devIdx], &binaries_sizes[i],
(const unsigned char**)&binaries[i],
nullptr, &e3);
if (e3 != CL_SUCCESS) {
++i;
continue;
}
int e4 = clBuildProgram(program, 1, &deviceIds[devIdx],
"", nullptr, nullptr);
e3 = e4;
++i;
}
Unfortunately, there is no standard solution for your problem. OpenCL is platform-independent, and there is no standard way (apart from SPIR) to deal with this problem. Each vendor decide a different compiler toolchain internally, and even this can change across multiple versions of the same driver, or for different devices.
You could add some meta-data to the kernel to identify which platform have you compiled it for, which will save you of the trial and error part (i.e, instead of just storing binaries and binaries_size, you can also store binary_platform and binary_device and then iterate through those arrays to see which binary you should load).
The best solution for you would be SPIR (or the new SPIRV), which are intermediate representations that can be then "re-compiled" by the OpenCL driver to the actual architecture instruction set.
If you store your binaries in SPIRV, and have access to/knowledge of some compiler magic, you can use a translator tool to get back the LLVM-IR and then compile down to other platforms, such as AMD or PTX, using the LLVM infrastructure (see https://github.com/KhronosGroup/SPIRV-LLVM)
Related
I'm trying to get device-side enqueuing working in my OpenCL application, but no matter what I do, get_default_queue() returns 0.
I've set up a second command queue with on device, on device default and out of order exec mode enabled
I've checked that the device enqueue capabilities supports device side enqueuing
I've checked that the main command queue device default returns the device command queue
I've got the CL3.0 compile tag
The kernel functions perfectly fine in every way except for this.
My code can be found here: https://github.com/labiraus/svo-tracer/blob/main/SvoTracer/SvoTracer.Kernel/EnqueueTest.cs
A stripped down version of the kernel:
kernel void test(global int *out) {
int i = get_global_id(0);
if (get_default_queue() != 0) {
out[i] = 1;
} else {
out[i] = 2;
}
}
In OpenCL 3.0 this became an optional feature. Whilst some devices will correctly return that device side enqueuing is not supported, it seems that the card I was testing on may not support it whilst reporting that it did.
Ultimately device-side enqueuing isn't functionality that can be relied upon to be implemented by OpenCL 3.0 compatible devices.
Everyone good time of day!
Not so long ago, I was able to parallel the recursive algorithm for searching for possible options for combining some events. At the moment, the code is as follows:
//#include's
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
At the moment, this thing really works well, and on six cores the CPU gives an increase of more than 5.7 from the single-core version.
As you can see, with a sufficiently large number of threads, there may be a failure associated with the simultaneous reading/writing of the variant variable. I understand she needs to be protected. At the moment, I see an output only in the use of blocking functions, since the critical section is not suitable because if the variable variant is written in only one section of the code (at the lowest level of recursion), then the reading occurs in many places.
Actually, here is the question - if I apply the constructions:
omp_lock_t lock;
int main() {
...
omp_init_lock(&lock);
#pragma omp parallel shared(variant, lock)
...
}
...
else (if we went down to the lowest level): {
if (condition fulfilled) { // condition check - READ variant variable
omp_set_lock(&lock);
variant = it_is_equal_to_that_,_to_that...;
omp_unset_lock(&lock);
}
else
continue;
...
will this lock protect the reading of the variable in all other places? Or will I need to manually check the lock status and pause the thread before reading elsewhere?
I will be incredibly grateful to the distinguished community for help!
In OpenMP specification (1.4.1 The structure of OpenMP memory model) you can read
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
This practically means that (as with any relaxed memory model), only at well-defined points, are threads guaranteed to have the same, consistent view on the value of shared variables. In between such points, the temporary view may be different across the threads.
In your code you handled the problem of simultaneous writing of the same variable, but there is no guarantee that an another thread reads the correct value of the variable without additional measures.
You have 3 options to do (Note that each of these solutions not only will handle simultaneous read/writes, but also provides a consistent view on the value of shared variables.):
If your variable is scalar type, the best solution is to use atomic operations. This is the fastest option as atomic operations are typically supported by the hardware.
#pragma omp parallel
{
...
#pragma omp atomic read
tmp=variant;
....
#pragma omp atomic write
variant=new_value;
}
Use critical construct. This solution can be used if your variable is a complex type (such as class) and its read/write cannot be performed atomically. Note that it is much less efficient (slower) than an atomic operation.
#pragma omp parallel
{
...
#pragma omp critical
tmp=variant;
....
#pragma omp critical
variant=new_value;
}
Use locks for each read/write of your variable. Your code is OK for write, but have to use it for reads as well. It requires the most coding, but practically the result is the same as using the critical construct. Note that OpenMP implementations typically use locks to implement critical constructs.
clBuildProgram allows one to give a list of devices to build the program for. That's the reason of the num_devices and device_list parameters in the declaration:
cl_int clBuildProgram(cl_program program, cl_uint num_devices, const cl_device_id *device_list, const char *options, void (CL_CALLBACK *pfn_notify)(cl_program program, void *user_data), void *user_data)
Now what happens, if we use it like this?
cl_int clBuildProgram(program, 0, NULL, ...
Does it build for all devices in the PC?
Does it build for only those devices I have created the context for? (I mean the context I used when I created program with clCreateProgramWithSource.)
The documentation says:
device_list: A pointer to a list of devices associated with program. If device_list is NULL value, the program executable is built for all devices associated with program for which a source or binary has been loaded. If device_list is a non-NULL value, the program executable is built for devices specified in this list for which a source or binary has been loaded.
I think the phrasing is a bit complicated here, but from that, I guess number 2. Is that right?
I am asking because in case of number 1, I would need to pass a device list to this function in order to avoid superfluous compilation for all devices.
2) is correct. Compilation is constrained to only the devices associated with the program's context. This cannot be every single device in the system unless the context was created using every single device.
Using AMDs APP OpenCL implementation with JOCL bindings, I'm trying to create a generic bracketing profiler using Java automatic resource management. The basic idea is:
class Timer implements AutoCloseable {
...
Timer {
...
clEnqueueMarker( commandQueue, startEvent );
}
void close() {
cl_event stopEvent = new cl_event();
clEnqueueMarker( commandQueue, stopEvent );
clFinish( commandQueue );
... calculate and output times ...
}
}
My problem is that profiling information is not available for the marker command events (stopEvent and startEvent). This is despite a) setting CL_QUEUE_PROFILING_ENABLE on the command queue and b) flushing and waiting on the command queue and verifying that the stop and start events are CL_COMPLETE with no errors.
So my question is, is profiling supported on marker commands in AMD OpenCL? If not, is it explicitly disallowed by the spec (I found nothing to this effect)?
Thanks.
I've rechecked the spec and it seems to me that what you get is normal (though I've never paid much attention to that detail previously). In the section 5.12 about profiling, the standard states:
This section describes profiling of OpenCL functions that are enqueued
as commands to a command-queue. The specific functions being
referred to are: clEnqueue{Read|Write|Map}Buffer,
clEnqueue{Read|Write}BufferRect, clEnqueue{Read|Write|Map}Image,
clEnqueueUnmapMemObject, clEnqueueCopyBuffer, clEnqueueCopyBufferRect,
clEnqueueCopyImage, clEnqueueCopyImageToBuffer,
clEnqueueCopyBufferToImage, clEnqueueNDRangeKernel , clEnqueueTask and
clEnqueueNativeKernel.
So the clEnqueueMarker() function is not in the list, and I guess the CL_PROFILING_INFO_NOT_AVAILABLE value returned makes sense.
I just tried this and it seems to work now. Tested on Windows 10 with an AMD 7870 and on Linux with Nvidias Titan Black and Titan X cards.
The OpenCL 1.2 specs still contain the paragraph #CaptainObvious quoted. The clEnqueueMarker function is still missing, but I can get profiling information without a problem.
The start and end times on marker events are always equal, which makes a lot of sense.
Btw. clEnqueueMarker is deprecated in OpenCL 1.2 and should be replaced with clEnqueueMarkerWithWaitList.
I've got existing C# code with signature of Foo(long[][] longs) which I need to call from Unmanaged C++ (not C++/CLI). I just can't seem to figure out the right combination of __gc[] and __gc* to make the compiler happy.
With C++/CLI, this is straight-forward:
std::vector<__int64> evens;
evens.push_back(2); evens.push_back(4); evens.push_back(6); evens.push_back(8);
std::vector<__int64> odds;
odds.push_back(1); odds.push_back(3); odds.push_back(5); odds.push_back(7);
std::vector<std::vector<__int64> > ints;
ints.push_back(evens); ints.push_back(odds);
array<array<__int64>^>^ mgdInts = gcnew array<array<__int64>^>(ints.size());
for (size_t i = 0; i<ints.size(); ++i)
{
const std::vector<__int64>& slice = ints[i];
mgdInts[i] = gcnew array<__int64>(slice.size());
for (size_t j=0; j<slice.size(); ++j)
mgdInts[i][j] = slice[j];
}
Edit: As I'm using Visual Studio 2008, the "simple" solution is to put the C++/CLI code in its own file and compile with /clr; of course, it would be easier if I didn't have to do this (e.g., other .h files with Managed C++). The C# code can't change as it's auto-generated from a web reference.
Change the signature from this
Foo(long[][] longs)
to this:
Foo(Array longs)
Then when you look at the resulting type library in OleView.exe, you should see:
HRESULT Foo([in] SAFEARRAY(int64) longs);
From C++, that's fairly straight forward to call. You can just Win32 to create a SAFEARRAY, or I suggest include and then use the CComSafeArray wrapper class in ATL.
Even though both C# and C++ have richer array definitions, the interoperability between the two is typically done though the Type Library marshaller, 99% of which is legacy and based on what's "VB Compatible". The only array types that the Type Library marshaller supports is SAFEARRAY, so that's what you get when you follow the normal way of doing all this.
However, COM supports a richer array system (conformant arrays), which C# understands, it's harder to do, and you can't simply regasm your C# DLL and use the resulting type library in your unmanaged C++ program. Some of the techniques require tweaking the C# code with ILDASM. Others require you to keep two definitions of the interface, one in C++ and one in C#, and make sure they're in sync (no way to convert one to the other), then in the IDL for C++ adorn the parameter with size_is, and in C# with MarshalAs. It's kind of a mess and really the only type people do that is if they have an already published legacy interface that they cannot change. If this is your C# code, and you can define the interface, I wouldn't go there. Still, the technique is available. Here's a refernece: http://support.microsoft.com/kb/305990
Expect about a week or so to get through this if you've never done anything like this before.
The solution I came up with is to use List<>.ToArray():
System::Collections::Generic::List<__int64 __gc[]>* mgdInts = new System::Collections::Generic::List<__int64 __gc[]>(ints.size());
for (size_t i = 0; i<ints.size(); ++i)
{
const std::vector<__int64>& slice = ints[i];
__int64 mgdSlice __gc[] = new __int64 __gc[slice.size()];
for (size_t j=0; j<slice.size(); ++j)
mgdSlice[j] = slice[j];
mgdInts->Add(mgdSlice);
}
ClassLibrary1::Class1::Foo(mgdInts->ToArray());