OpenCL device uniqueness - opencl

Is there a way to get OpenCL to give me a list of all unique physical devices which have an OpenCL implementation available? I know how to iterate through the platform/device list but for instance, in my case, I have one Intel-provided platform which gives me an efficient device implementation for my CPU, and the APP platform which provides a fast implementation for my GPU but a terrible implementation for my CPU.
Is there a way to work out that the two CPU devices are in fact the same physical device, so that I can choose the most efficient one and work with that, instead of using both and having them contend with each other for compute time on the single physical device?
I have looked at CL_DEVICE_VENDOR_ID and CL_DEVICE_NAME but they don't solve my issues, the CL_DEVICE_NAME will be the same for two separate physical devices of the same model (dual GPU's) and CL_DEVICE_VENDOR_ID gives me a different ID for my CPU depending on the platform.
An ideal solution would be some sort of unique physical device ID, but I'd be happy with manually altering the OpenCL configuration to rearrange the devices myself (if such a thing is possible).

As far as I could investigate the issue now, there is no reliable solution. If all your work is done within a single process, you may use the order of entries returned by clGetDeviceIDs or cl_device values themselves (essentially they're pointers), but things get worse if you try to share those identifiers between processes.
See that guy's blog post about it, saying:
The issue is that if you have two identical GPUs, you can’t distinguish between them. If you call clGetDeviceIDs, the order in which they are returned is actually unspecified, so if the first process picks the first device and the second takes the second device, they both may wind up oversubscribing the same GPU and leaving the other one idle.
However, he notes that nVidia and AMD provide their custom extensions, cl_amd_device_topology and cl_nv_device_attribute_query. You may check whether these extensions are supported by your device, and then use them as the following (the code by original author):
// This cl_ext is provided as part of the AMD APP SDK
#include <CL/cl_ext.h>
cl_device_topology_amd topology;
status = clGetDeviceInfo (devices[i], CL_DEVICE_TOPOLOGY_AMD,
sizeof(cl_device_topology_amd), &topology, NULL);
if(status != CL_SUCCESS) {
// Handle error
}
if (topology.raw.type == CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD) {
std::cout << "INFO: Topology: " << "PCI[ B#" << (int)topology.pcie.bus
<< ", D#" << (int)topology.pcie.device << ", F#"
<< (int)topology.pcie.function << " ]" << std::endl;
}
or (code by me, adapted from the above linked post):
#define CL_DEVICE_PCI_BUS_ID_NV 0x4008
#define CL_DEVICE_PCI_SLOT_ID_NV 0x4009
cl_int bus_id;
cl_int slot_id;
status = clGetDeviceInfo (devices[i], CL_DEVICE_PCI_BUS_ID_NV,
sizeof(cl_int), &bus_id, NULL);
if (status != CL_SUCCESS) {
// Handle error.
}
status = clGetDeviceInfo (devices[i], CL_DEVICE_PCI_BUS_ID_NV,
sizeof(cl_int), &slot_id, NULL);
if (status != CL_SUCCESS) {
// Handle error.
}
std::cout << "Topology = [" << bus_id <<
":"<< slot_id << "]" << std::endl;

If you have two devices of the exact same kind belonging to a platform, you can tell them apart by the associated cl_device_ids return by clGetDeviceIDs.
If you have devices that can be used by two different platforms you can eliminate the entries for the second platform by comparing the device names from CL_DEVICE_NAME.
If you want to find the intended platform for a device, compare the CL_PLATFORM_VENDOR and CL_DEVICE_VENDOR strings from clGetPlatformInfo() and clGetDeviceInfo respectively.
You can read in all platforms and all their associated devices into separate platform-specific lists and then eliminate doubles by comparing the device names in the separate lists. This should ensure that you do not get the same device for different platforms.
Finally you can, by command line arguments or configuration file for example, give arguments to your application to associate devices of a certain type (CPU, GPU, Accelerator) with a specific platform if there exists a choice of different platforms for a device type. Hopefully this will answer your question.

anyway let's just assume that you are trying to pull the unique id for all devices, actually you can just simply query with clGetDeviceIDs:
cl_int clGetDeviceIDs(cl_platform_id platform,
cl_device_type device_type,
cl_uint num_entries,
cl_device_id *devices,
cl_uint *num_devices)
then your list of device will be inserted to the *devices array, and then you can do clGetDeviceInfo() to find out which device you'd like to use.

Combining answers above, my solution was:
long bus = 0; // leave it 0 for Intel
// update bus for NVIDIA/AMD ...
// ...
long uid = (bus << 5) | device_type;
Variable bus was computed according NVIDIA/AMD device-specific info queries, as mentioned firegurafiku, variable device_type was result of clGetDeviceInfo(clDevice, CL_DEVICE_TYPE, sizeof(cl_device_type), &device_type, nullptr) API call, as Steinin suggested.
Such approach solved issue of having equal unique ID for Intel CPU with integrated GPU. Now both devices have unique identifiers, thank to different CL_DEVICE_TYPE's.
Surprizingly, the case of running code on Oclgrind-emulated device, Oclgrind simulator device also gets unique identifier 15, disctinct from any other on my system.
The only case when proposed approach can fail - several CPUs of same model on a single mainboard.

Related

why does not clEnequeMapBuffer map to original pointer, OpenCL, Caffe

Assume that a CPU pointer(cpu_ptr_) already exists, then I create a buffer for gpu(cl_gpu_mem_). The problem is that when I map the gpu buffer to a cpu pointer(mapped_ptr), the mapped_ptr is not equal to the original pointer (cpu_ptr_), which causes that CHECK_EQ(mapped_ptr, cpu_ptr_) raises an error.
cl_gpu_mem_ = clCreateBuffer(ctx.handle().get(),
CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
size_, cpu_ptr_, &err);
void *mapped_ptr = clEnqueueMapBuffer(
ctx.get_queue().handle().get(),
cl_gpu_mem_,
true,
CL_MAP_READ | CL_MAP_WRITE,
0, size_, 0, NULL, NULL, NULL);
CHECK_EQ(mapped_ptr, cpu_ptr_)
<< "Device claims it supports zero copy"
<< " but failed to create correct user ptr buffer";
I don't know why this error occurs at all. Would you please give me any advice for this problem, or any solution to it. Thank you very much.
OpenCL implementations are free to mirror the host pointer (making it non-zero copy). On devices that support true zero copy (e.g. Intel GPU), there are still typically constraints that impose whether we can really use that host allocation directory or must mirror it. On Intel the host address must be page aligned and a the length must be a multiple of 128 bytes (an even cacheline). (I typically just page align both.) I am not sure what AMD and other's requirements are.
Look into aligned_alloc or overallocate a couple extra pages via and use a page boundary for the base.

MFRC522 and specific sector/block reading

I am creating a game using the Mifare tags embedded in 8 different playing pieces. I will be using an Arduino NANO with the MFRC522 (library https://github.com/miguelbalboa/rfid) to do the actual reading of the tags, and am using an ER301 reader/writer (with eReader software) to assign playing piece numbers to them. I will be creating multiples of each piece to head off any issues I would have with loss due to breakage or theft (due to these being rather unique playing pieces). Since there will be 8 different pieces, and 4 copies of each piece, that would be 32 UIDs to keep up with. I would rather assign a different number to each of pieces, and the same number of each piece to its duplicates - so only 8 numbers to keep up with.
My question is - how do I read a certain block and sector with the MFRC522?
Specifically, sector 2, block 8 - because this is where the Hex equivalent of the playing piece number shows up (when it is assigned as a Product Name with the eReader software and the ER301 writer). I understand using the library for the MFRC522 to read the UID, but this is a bit more in-depth than my understanding.
I have written several Sketches for the Arduino, but this is my foray into the world of RFID, and is quite a bit more extensive than my previous Arduino projects. Once I can read the specific sector & block, the Arduino NANO will output a binary representation (on 4 of the digital I/Os) of which playing piece was placed.
The library you are using provides dedicated methods to perform read and write operations on MIFARE tags:
StatusCode MIFARE_Read(byte blockAddr, byte *buffer, byte *bufferSize);
StatusCode MIFARE_Write(byte blockAddr, byte *buffer, byte bufferSize);
Since your description (sector 2, block 8) suggests that you are using MIFARE Classic tags, you would also need to authenticate to the tag in order to perform read/write operations. Thus, you would also need the authentication method:
StatusCode PCD_Authenticate(byte command, byte blockAddr, MIFARE_Key *key, Uid *uid);
Just as you would use the library to read the UID
if (mfrc522.PICC_ReadCardSerial()) {
Serial.print(F("Card UID:"));
dump_bytes(mfrc522.uid.uidByte, mfrc522.uid.size);
}
you could also access these read/write methods:
MFRC522::StatusCode status;
MFRC522::MIFARE_Key key;
byte buffer[18];
byte size = sizeof(buffer);
for (byte i = 0; i < MFRC522::MF_KEY_SIZE; ++i) {
key.keyByte[i] = 0xFF;
}
if (mfrc522.PICC_ReadCardSerial()) {
status = mfrc522.PCD_Authenticate(MFRC522::PICC_CMD_MF_AUTH_KEY_A, 8, &key, &(mfrc522.uid));
if (status == MFRC522::STATUS_OK) {
status = mfrc522.MIFARE_Read(8, buffer, &size);
if (status == MFRC522::STATUS_OK) {
Serial.print(F("Data (block = 8): "));
dump_bytes(buffer, 16);
}
}
}
Note that I assume block 8 (= sector 2, block 0) to be readbale using key A and that key A is set to the default transport key FF FF FF FF FF FF. If your other reader changed those values, you need to adapt the code accordingly. Moreover I used the pseudo-method dump_bytes(array, length) to indicate that the interesting value is the first length bytes of array. An implementation that actually prints those values is up to you.
Btw. a full example on how to use the library for read/write operations actually ships together with the library!. So you could just take a look at ReadAndWrite.ino on how to use that library.

OpenBSD serial I/O: -lpthead makes read() block forever, even with termios VTIME set?

I have an FTDI USB serial device which I use via the termios serial API. I set up the port so that it will time-out on read() calls in half a second (by using the VTIME parameter), and this works on Linux as well as on FreeBSD. On OpenBSD 5.1, however, the read() call simply blocks forever when no data is available (see below.) I would expect read() to return 0 after 500ms.
Can anyone think of a reason that the termios API would behave differently under OpenBSD, at least with respect to the timeout feature?
EDIT: The no-timeout problem is caused by linking against pthread. Regardless of whether I'm actually using any pthreads, mutexes, etc., simply linking against that library causes read() to block forever instead of timing out based on the VTIME setting. Again, this problem only manifests on OpenBSD -- Linux and FreeBSD work as expected.
if ((sd = open(devPath, O_RDWR | O_NOCTTY)) >= 0)
{
struct termios newtio;
char input;
memset(&newtio, 0, sizeof(newtio));
// set options, including non-canonical mode
newtio.c_cflag = (CREAD | CS8 | CLOCAL);
newtio.c_lflag = 0;
// when waiting for responses, wait until we haven't received
// any characters for 0.5 seconds before timing out
newtio.c_cc[VTIME] = 5;
newtio.c_cc[VMIN] = 0;
// set the input and output baud rates to 7812
cfsetispeed(&newtio, 7812);
cfsetospeed(&newtio, 7812);
if ((tcflush(sd, TCIFLUSH) == 0) &&
(tcsetattr(sd, TCSANOW, &newtio) == 0))
{
read(sd, &input, 1); // even though VTIME is set on the device,
// this read() will block forever when no
// character is available in the Rx buffer
}
}
from the termios manpage:
Another dependency is whether the O_NONBLOCK flag is set by open() or
fcntl(). If the O_NONBLOCK flag is clear, then the read request is
blocked until data is available or a signal has been received. If the
O_NONBLOCK flag is set, then the read request is completed, without
blocking, in one of three ways:
1. If there is enough data available to satisfy the entire
request, and the read completes successfully the number of
bytes read is returned.
2. If there is not enough data available to satisfy the entire
request, and the read completes successfully, having read as
much data as possible, the number of bytes read is returned.
3. If there is no data available, the read returns -1, with errno
set to EAGAIN.
can you check if this is the case?
cheers.
Edit: OP traced back the problem to a linking with pthreads that caused the read function to block. By upgrading to OpenBSD >5.2 this issue was resolved by the change to the new rthreads implementation as the default threading library on openbsd. more info on guenther# EuroBSD2012 slides

enqueueWriteBuffer locking up when sending non-32 bit aligned data

I am working on an opencl project and I have run into an issue where if I try and send data to from the cpu to global memory then sometimes it locks up the application. This happens sporadically. I can run it x times in a row and the next time it locks. It only appears to be happening if I try and send data that is not 32 bits wide. For example, I can send float and int just fine, but when I try short, char, or half then I get random lockups. It is not dying with badly initialized data or something, because it does run, just not all the time. I also put some logging in and found that it is always locking up just after attempting to write one of these non-standard size data arrays. I am running on an NVIDIA GeForce GT 330M. Below is a snippet of the code I am running to send the data. I am using the c++ interface on the host side.
cl_half *m_aryTest;
shared_ptr< cl::Buffer > m_bufTest;
m_aryTest = new cl_half[m_iNeuronCount];
m_bufTest = shared_ptr<cl::Buffer>( new cl::Buffer(m_lpNervousSystem->ActiveContext(), CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(m_aryTest)*m_iNeuronCount, m_aryTest));
kernel.setArg(8, *(m_bufTest.get()));
printf("m_bufTest.\n");
m_lpQueue->enqueueWriteBuffer(*(m_bufTest.get()), CL_TRUE, 0, sizeof(m_aryTest)*m_iNeuronCount, m_aryTest, NULL, NULL);
Does anyone have any ideas on why this is happening?
Thanks

Simple algorithm for reliable communications

So, I have worked on large systems in the past, like an iso stack session layer, and something like that is too big for what I need, but I do have some understanding of the big picture. What I have now is a serial point to point communications link, where some component is dropping data (often).
So I am going to have to write my own, reliable delivery system using it for transport. Can someone point me in the directions for basic algorithms, or even give a clue as to what they are called? I tried a Google, but end up with post graduate theories on genetic algorithms and such. I need the basics. e.g. 10-20 lines of pure C.
XMODEM. It's old, it's bad, but it is widely supported both in hardware and in software, with libraries available for literally every language and market niche.
HDLC - High-Level Data Link Control. It's the protocol which has fathered lots of reliable protocols over the last 3 decades, including the TCP/IP. You can't use it directly, but it is a template how to develop your own protocol. Basic premise is:
every data byte (or packet) is numbered
both sides of communication maintain locally two numbers: last received and last sent
every packet contains the copy of two number
every successful transmission is confirmed by sending back an empty (or not) packet with the updated numbers
if transmission is not confirmed within some timeout, send again.
For special handling (synchronization) add flags to the packet (often only one bit is sufficient, to tell that the packet is special and use). And do not forget the CRC.
Neither of the protocols has any kind of session support. But you can introduce one by simply adding another layer - a simple state machine and a timer:
session starts with a special packet
there should be at least one (potentially empty) packet within specified timeout
if this side hasn't sent a packet within the timeout/2, send an empty packet
if there was no packet seen from the other side of communication within the timeout, the session has been termianted
one can use another special packet for graceful session termination
That is as simple as session control can get.
There are (IMO) two aspects to this question.
Firstly, if data is being dropped then I'd look at resolving the hardware issues first, as otherwise you'll have GIGO
As for the comms protocols, your post suggests a fairly trivial system? Are you wanting to validate data (parity, sumcheck?) or are you trying to include error correction?
If validation is all that is required, I've got reliable systems running using RS232 and CRC8 sumchecks - in which case this StackOverflow page probably helps
If some components are droping data in a serial point to point link, there must exist some bugs in your code.
Firstly, you should comfirm that there is no problem in the physical layer's communication
Secondly, you need some konwledge about data communication theroy such like ARQ(automatic request retransmission)
Further thoughts, after considering your response to the first two answers... this does indicate hardware problems, and no amount of clever code is going to fix that.
I suggest you get an oscilloscope onto the link, which should help to determine where the fault lies. In particular look at the baud rate of the two sides (Tx, Rx) to ensure that they are within spec... auto-baud is often a problem?!
But look to see if drop out is regular, or can be sync-ed with any other activity.
on the sending side;
///////////////////////////////////////// XBee logging
void dataLog(int idx, int t, float f)
{
ubyte stx[2] = { 0x10, 0x02 };
ubyte etx[2] = { 0x10, 0x03 };
nxtWriteRawHS(stx, 2, 1);
wait1Msec(1);
nxtWriteRawHS(idx, 2, 1);
wait1Msec(1);
nxtWriteRawHS(t, 2, 1);
wait1Msec(1);
nxtWriteRawHS(f, 4, 1);
wait1Msec(1);
nxtWriteRawHS(etx, 2, 1);
wait1Msec(1);
}
on the receiving side
void XBeeMonitorTask()
{
int[] lastTick = Enumerable.Repeat<int>(int.MaxValue, 10).ToArray();
int[] wrapCounter = new int[10];
while (!XBeeMonitorEnd)
{
if (XBee != null && XBee.BytesToRead >= expectedMessageSize)
{
// read a data element, parse, add it to collection, see above for message format
if (XBee.BaseStream.Read(XBeeIncoming, 0, expectedMessageSize) != expectedMessageSize)
throw new InvalidProgramException();
//System.Diagnostics.Trace.WriteLine(BitConverter.ToString(XBeeIncoming, 0, expectedMessageSize));
if ((XBeeIncoming[0] != 0x10 && XBeeIncoming[1] != 0x02) || // dle stx
(XBeeIncoming[10] != 0x10 && XBeeIncoming[11] != 0x03)) // dle etx
{
System.Diagnostics.Trace.WriteLine("recover sync");
while (true)
{
int b = XBee.BaseStream.ReadByte();
if (b == 0x10)
{
int c = XBee.BaseStream.ReadByte();
if (c == 0x03)
break; // realigned (maybe)
}
}
continue; // resume at loop start
}
UInt16 idx = BitConverter.ToUInt16(XBeeIncoming, 2);
UInt16 tick = BitConverter.ToUInt16(XBeeIncoming, 4);
Single val = BitConverter.ToSingle(XBeeIncoming, 6);
if (tick < lastTick[idx])
wrapCounter[idx]++;
lastTick[idx] = tick;
Dispatcher.BeginInvoke(DispatcherPriority.ApplicationIdle, new Action(() => DataAdd(idx, tick * wrapCounter[idx], val)));
}
Thread.Sleep(2); // surely we can up with the NXT
}
}

Resources