Creating a context on any platform/device - opencl

I currently am writing some unittests for OpenCL kernels and need to create a context.
As I am not after performance it does not matter to me which device is running the kernel.
So I want to create the context with as little restrictions as possible and thought of this code:
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <iostream>
int main() {
try {
cl::Context context(CL_DEVICE_TYPE_ALL);
}catch(const cl::Error &err) {
std::cerr << "Caught cl::Error (" << err.what() << "): " << err.err() << "\n";
}
}
Which returns
Caught cl::Error (clCreateContextFromType): -32
-32 is CL_INVALID_PLATFORM and the documentation of clCreateContextFromType says:
CL_INVALID_PLATFORM if properties is NULL and no platform could be selected or if platform value specified in properties is not a valid platform.
As I have not provided any properties they are of course NULL.
Why can't any platform be selected?
I also have tried CL_DEVICE_TYPE_DEFAULT with the same result.
Here is a list of my platform and device that were detected:
NVIDIA CUDA
(GPU) GeForce GTX 560
As a side node: Specifying the platform in the properties works as intented.

I tried your code, using CL_DEVICE_TYPE_ALL and it worked on my setup. I'm not sure why it's not working on yours...
As a workaround, maybe you could do something like this:
// Get all available platforms
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Pick the first and get all available devices
std::vector<cl::Device> devices;
platforms[0].getDevices(CL_DEVICE_TYPE_ALL, &devices);
// Create a context on the first device, whatever it is
cl::Context context(devices[0]);

cl::Context context(cl_device_type) is calling the complete constructor with default parameters:
cl::Context::Context(cl_device_type type,
cl_context_properties * properties = NULL,
void(CL_CALLBACK *notifyFptr)(const char *,const void *,::size_t,void *) = NULL,
void * data = NULL,
cl_int * err = NULL
)
Witch is only a wrapper of C++ to the underlying clCreateContextFromType().
This function allows that a NULL pointer is passed as a property, but then, the platform selection is implementation dependent. And it looks like in your case it does not default to nVIDIA platform.
You will have to pass some info to the constructor I'm afraid....

Related

CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism

Kernel codes that produce the error:
__kernel void testDynamic(__global int *data)
{
int id=get_global_id(0);
atomic_add(&data[1],2);
}
__kernel void test(__global int * data)
{
int id=get_global_id(0);
atomic_add(&data[0],2);
if (id == 0) {
queue_t q = get_default_queue();
ndrange_t ndrange = ndrange_1D(1,1);
void (^my_block_A)(void) = ^{testDynamic(data);};
enqueue_kernel(q, CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
ndrange,
my_block_A);
}
}
I tested below code to be sure OpenCL 2.0 compiler is working.
__kernel void test2(__global int *data)
{
int id=get_global_id(0);
data[id]=work_group_scan_inclusive_add(id);
}
scan function gives 0,1,3,6 as outputs so OpenCL 2.0 reduction functions are working.
Is dynamic parallelism an extension to OpenCL 2.0? If I remove enqueue_kernel command, results are equal the the expected values(omitting child kernel).
Device: Amd RX550, driver: 17.6.2
Is there a special command that needs to be run on host side, to run child kernel on get_default_queue queue? For now, command queue is created with an OpenCL 1.2 way as below:
commandQueue = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
Does get_default_queue() have to be the same command queue which calls the parent kernel? Asking this because I'm using same command queue to upload data to GPU and then download results, in a single synchronization.
Moved solution from question to answer:
Edit: below API command was the solution:
commandQueue = cl::CommandQueue(context, device,
CL_QUEUE_ON_DEVICE|
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE |
CL_QUEUE_ON_DEVICE_DEFAULT, &err);
after creating this queue(only 1 per device), didn't use it for anything else and also the parent kernel is enqueued on any other host queue so it looks like get_default_queue() doesn't have to be the parent-calling queue.
Documentation says CL_INVALID_QUEUE_PROPERTIES will be thrown if CL_QUEUE_ON_DEVICE is specified but for my machine, dynamic parallelism works with it and doesn't throw that error(as the upper commandQueue constructor parameters).

Error compiling VexCL program: raw_ptr not a member of device_vector

I am trying to compile my first VexCL program using the thrust example and I get the following error message:
raw_ptr is not a member of 'vex::backend::opencl::device_vector'
Here is the code
vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
std::cout << ctx << std::endl;
vex::profiler<> prof(ctx);
typedef int T;
const size_t n = 16 * 1024 * 1024;
vex::vector<T> x(ctx, n);
vex::Random<T> rnd;
// Get raw pointers to the device memory.
T *x_begin = x(0).raw_ptr(); // Here is where the error is occurring.
T *x_end = x_begin + x.size();
I do not understand the language well enough. I appreciate any help in this matter.
Thanks
Chris
The thrust example is not the best to start with, as it deals with interfacing VexCL and Thrust (another high-level library that is targeted on CUDA).
So in order to compile the example, you need to use the CUDA backend in VexCL. That is, you need to define VEXCL_BACKEND_CUDA preprocessor macro
and to link against libcuda.so (or cuda.lib if on Windows) instead of libOpenCL.so/OpenCL.lib.
The error you got is because the device_vector class only exposes raw_ptr() method when on CUDA backend.

Qt ActiveX "Unknown error"

I'm working on integrating T-Cube motor controller (http://www.thorlabs.de/newgrouppage9.cfm?objectgroup_id=2419) into the software based on Qt-4.8.1 package. Due to there is no manual or any sort of tutorial how to retrieve ActiveX object and how to call methods I did the following.
1) Looked through Windows registry looking for words similar to motor controller name. Found a candidate with CLSID "{3CE35BF3-1E13-4D2C-8C0B-DEF6314420B3}".
2) Tried initializing it in the following way (code provided is shortened, all result checks are removed in order to improve readability):
HRESULT h_result = CoInitializeEx(NULL, COINIT_MULTITHREADED);
pd->moto = new QAxObject();
initialized = moto->setControl( "{3CE35BF3-1E13-4D2C-8C0B-DEF6314420B3}" );
QString stri = browser->generateDocumentation();
obj->dynamicCall("SetHWSerialNum(int)", params);
QVariantList params;
params << 0;
params << 0.0;
int result = pd->moto->dynamicCall("GetPosition(int, double&)", params).toInt();
value = params[1].toFloat();
QVariantList params;
params << 0;
params << dist;
params << dist;
params << true;
int result = pd->moto->dynamicCall("MoveRelativeEx(int, double, double, bool)", params).toInt();
3) generateDocumentation() method gives perfect description of ~150 methods.
4) All dynamicCall() invocations cause "Error calling ...: Unknown error", where "..." is a first argument of dynamicCall() from the list generateDocumentation()'s given me.
5) If I insert into dynamicCall() any method which isn't presented in the documentation generated the output is different. So I suppose that methods in documentation generated really exist.
6) If I use #import directive and try calling directly avoiding QAxObject usage I see "mg17motor.tlh" file but none of interfaces described there contain any methods. So I can't use it directly as well. Is it normal?
I would be very much obliged for any advice.
You can find the ActiveX object using the OLE viewer. Then search for something like
APT.. or MG.. under all objects. Then find the parameter ProgID=MGMOTOR.MGMotorCtrl.1.
Now in Qt don't use QAxObject but QAxWidget. Then you get something like:
QAxWidget* aptMotor;
QVariant chanID = QVariant(0);
aptMotor = new QAxWidget();
aptMotor->setControl("MGMOTOR.MGMotorCtrl.1");
//Nice html documentation on available functions
QFile file("out.html");
file.open(QIODevice::WriteOnly | QIODevice::Text);
QTextStream out(&file);
out << aptMotor->generateDocumentation();
file.close();
aptMotor->setProperty("HWSerialNum",QVariant(83853493));
aptMotor->dynamicCall("StartCtrl");
aptMotor->dynamicCall("EnableHWChannel(QVariant)",chanID);
QThread::sleep(1); // Give it time to enable the channel
double pos(0);
aptMotor->dynamicCall("SetAbsMovePos(QVariant,QVariant)",chanID,QVariant(pos));
aptMotor->dynamicCall("MoveAbsolute(QVariant,QVariant,QVariant)",chanID,0);
aptMotor->dynamicCall("StopCtrl");

Copying a struct containing pointers to CUDA device

I'm working on a project where I need my CUDA device to make computations on a struct containing pointers.
typedef struct StructA {
int* arr;
} StructA;
When I allocate memory for the struct and then copy it to the device, it will only copy the struct and not the content of the pointer. Right now I'm working around this by allocating the pointer first, then set the host struct to use that new pointer (which resides on the GPU). The following code sample describes this approach using the struct from above:
#define N 10
int main() {
int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA *h_a = (StructA*)malloc(sizeof(StructA));
StructA *d_a;
int *d_arr;
// 1. Allocate device struct.
cudaMalloc((void**) &d_a, sizeof(StructA));
// 2. Allocate device pointer.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);
// 3. Copy pointer content from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);
// 4. Point to device pointer in host struct.
h_a->arr = d_arr;
// 5. Copy struct from host to device.
cudaMemcpy(d_a, h_a, sizeof(StructA), cudaMemcpyHostToDevice);
// 6. Call kernel.
kernel<<<N,1>>>(d_a);
// 7. Copy struct from device to host.
cudaMemcpy(h_a, d_a, sizeof(StructA), cudaMemcpyDeviceToHost);
// 8. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
// 9. Point to host pointer in host struct.
h_a->arr = h_arr;
}
My question is: Is this the way to do it?
It seems like an awful lot of work, and I remind you that this is a very simple struct. If my struct contained a lot of pointers or structs with pointers themselves, the code for allocation and copy will be quite extensive and confusing.
Edit: CUDA 6 introduces Unified Memory, which makes this "deep copy" problem a lot easier. See this post for more details.
Don't forget that you can pass structures by value to kernels. This code works:
// pass struct by value (may not be efficient for complex structures)
__global__ void kernel2(StructA in)
{
in.arr[threadIdx.x] *= 2;
}
Doing so means you only have to copy the array to the device, not the structure:
int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA h_a;
int *d_arr;
// 1. Allocate device array.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);
// 2. Copy array contents from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);
// 3. Point to device pointer in host struct.
h_a.arr = d_arr;
// 4. Call kernel with host struct as argument
kernel2<<<N,1>>>(h_a);
// 5. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
// 6. Point to host pointer in host struct
// (or do something else with it if this is not needed)
h_a.arr = h_arr;
As pointed out by Mark Harris, structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel.
Consider the following example
#include <stdio.h>
#include "Utilities.cuh"
#define NUMBLOCKS 512
#define NUMTHREADS 512 * 2
/***************/
/* TEST STRUCT */
/***************/
struct Lock {
int *d_state;
// --- Constructor
Lock(void) {
int h_state = 0; // --- Host side lock state initializer
gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int))); // --- Allocate device side lock state
gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
}
// --- Destructor (wrong version)
//~Lock(void) {
// printf("Calling destructor\n");
// gpuErrchk(cudaFree(d_state));
//}
// --- Destructor (correct version)
// __host__ __device__ ~Lock(void) {
//#if !defined(__CUDACC__)
// gpuErrchk(cudaFree(d_state));
//#else
//
//#endif
// }
// --- Lock function
__device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }
// --- Unlock function
__device__ void unlock(void) { atomicExch(d_state, 0); }
};
/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCounterLocked(Lock lock, int *nblocks) {
if (threadIdx.x == 0) {
lock.lock();
*nblocks = *nblocks + 1;
lock.unlock();
}
}
/********/
/* MAIN */
/********/
int main(){
int h_counting, *d_counting;
Lock lock;
gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));
// --- Locked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCounterLocked << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the locked case: %i\n", h_counting);
gpuErrchk(cudaFree(d_counting));
}
with the uncommented destructor (do not pay too much attention on what the code actually does). If you run that code, you will receive the following output
Calling destructor
Counting in the locked case: 512
Calling destructor
GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37
There are then two calls to the destructor, once at the kernel exit and once at the main exit. The error message is related to the fact that, if the memory locations pointed to by d_state are freed at the kernel exit, they cannot be freed anymore at the main exit. Accordingly, the destructor must be different for host and device executions. This is accomplished by the commented destructor in the above code.
struct of arrays is a nightmare in cuda. You will have to copy each of the pointer to a new struct which the device can use. Maybe you instead could use an array of structs? If not the only way I have found is to attack it the way you do, which is in no way pretty.
EDIT:
since I can't give comments on the top post: Step 9 is redundant, since you can change step 8 and 9 into
// 8. Copy pointer from device to host.
cudaMemcpy(h->arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

Condition Variable in Shared Memory - is this code POSIX-conformant?

Does the POSIX standard allow a named shared memory block to contain a mutex and condition variable?
We've been trying to use a mutex and condition variable to synchronise access to named shared memory by two processes on a LynuxWorks LynxOS-SE system (POSIX-conformant).
One shared memory block is called "/sync" and contains the mutex and condition variable, the other is "/data" and contains the actual data we are syncing access to.
We're seeing failures from pthread_cond_signal() if both processes don't perform the mmap() calls in exactly the same order, or if one process mmaps in some other piece of shared memory before it mmaps the "/sync" memory.
This example code is about as short as I can make it:
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/file.h>
#include <stdlib.h>
#include <pthread.h>
#include <errno.h>
#include <iostream>
#include <string>
using namespace std;
static const string shm_name_sync("/sync");
static const string shm_name_data("/data");
struct shared_memory_sync
{
pthread_mutex_t mutex;
pthread_cond_t condition;
};
struct shared_memory_data
{
int a;
int b;
};
//Create 2 shared memory objects
// - sync contains 2 shared synchronisation objects (mutex and condition)
// - data not important
void create()
{
// Create and map 'sync' shared memory
int fd_sync = shm_open(shm_name_sync.c_str(), O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
ftruncate(fd_sync, sizeof(shared_memory_sync));
void* addr_sync = mmap(0, sizeof(shared_memory_sync), PROT_READ|PROT_WRITE, MAP_SHARED, fd_sync, 0);
shared_memory_sync* p_sync = static_cast<shared_memory_sync*> (addr_sync);
// init the cond and mutex
pthread_condattr_t cond_attr;
pthread_condattr_init(&cond_attr);
pthread_condattr_setpshared(&cond_attr, PTHREAD_PROCESS_SHARED);
pthread_cond_init(&(p_sync->condition), &cond_attr);
pthread_condattr_destroy(&cond_attr);
pthread_mutexattr_t m_attr;
pthread_mutexattr_init(&m_attr);
pthread_mutexattr_setpshared(&m_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(&(p_sync->mutex), &m_attr);
pthread_mutexattr_destroy(&m_attr);
// Create the 'data' shared memory
int fd_data = shm_open(shm_name_data.c_str(), O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
ftruncate(fd_data, sizeof(shared_memory_data));
void* addr_data = mmap(0, sizeof(shared_memory_data), PROT_READ|PROT_WRITE, MAP_SHARED, fd_data, 0);
shared_memory_data* p_data = static_cast<shared_memory_data*> (addr_data);
// Run the second process while it sleeps here.
sleep(10);
int res = pthread_cond_signal(&(p_sync->condition));
assert(res==0); // <--- !!!THIS ASSERT WILL FAIL ON LYNXOS!!!
munmap(addr_sync, sizeof(shared_memory_sync));
shm_unlink(shm_name_sync.c_str());
munmap(addr_data, sizeof(shared_memory_data));
shm_unlink(shm_name_data.c_str());
}
//Open the same 2 shared memory objects but in reverse order
// - data
// - sync
void open()
{
sleep(2);
int fd_data = shm_open(shm_name_data.c_str(), O_RDWR, S_IRUSR|S_IWUSR);
void* addr_data = mmap(0, sizeof(shared_memory_data), PROT_READ|PROT_WRITE, MAP_SHARED, fd_data, 0);
shared_memory_data* p_data = static_cast<shared_memory_data*> (addr_data);
int fd_sync = shm_open(shm_name_sync.c_str(), O_RDWR, S_IRUSR|S_IWUSR);
void* addr_sync = mmap(0, sizeof(shared_memory_sync), PROT_READ|PROT_WRITE, MAP_SHARED, fd_sync, 0);
shared_memory_sync* p_sync = static_cast<shared_memory_sync*> (addr_sync);
// Wait on the condvar
pthread_mutex_lock(&(p_sync->mutex));
pthread_cond_wait(&(p_sync->condition), &(p_sync->mutex));
pthread_mutex_unlock(&(p_sync->mutex));
munmap(addr_sync, sizeof(shared_memory_sync));
munmap(addr_data, sizeof(shared_memory_data));
}
int main(int argc, char** argv)
{
if(argc>1)
{
open();
}
else
{
create();
}
return (0);
}
Run this program with no args, then another copy with args, and the first one will fail at the assert checking the pthread_cond_signal().
But change the order of the open() function to mmap() the "/sync" memory before the "/data" and it will all work fine.
This seems like a major bug in LynxOS to me, but LynuxWorks claim that using mutex and condition variable within named shared memory in this way is not covered by the POSIX standard, so they are not interested.
Can anyone determine if this code does actually violate POSIX?
Or does anyone have any convincing documentation that it is POSIX compliant?
Edit: we know that PTHREAD_PROCESS_SHARED is POSIX and is supported by LynxOS. The area of contention is whether mutexes and semaphores can be used within named shared memory (as we have done) or if POSIX only allows them to be used when one process creates and mmaps the shared memory and then forks the second process.
The pthread_mutexattr_setpshared function may be used to allow a pthread mutex in shared memory to be accessed by any thread which has access to that memory, even threads in different processes. According to this link, pthread_mutex_setpshared conforms to POSIX P1003.1c. (Same thing goes for the condition variable, see pthread_condattr_setpshared.)
Related question: pthread condition variables on Linux, odd behaviour
I can easily see how PTHREAD_PROCESS_SHARED can be tricky to implement on the OS-level (e.g. MacOS doesn't, except for rwlocks it seems). But just from reading the standard, you seem to have a case.
For completeness, you might want to assert on sysconf(_SC_THREAD_PROCESS_SHARED) and the return value of the *_setpshared() function calls— maybe there's another "surprise" waiting for you (but I can see from the comments that you already checked that SHARED is actually supported).
#JesperE: you might want to refer to the API docs at the OpenGroup instead of the HP docs.
May be there is some pointers in pthread_cond_t (without pshared), so you must place it into the same addresses in both threads/processes. With same-ordered mmaps you may get a equal addresses for both processes.
In glibc the pointer in cond_t was to thread descriptor of thread, owned mutex/cond.
You can control addresses with non-NULL first parameter to mmap.

Resources