I have problem with my OpenCL code. I compile and running it on CPU (core 2 duo) Mac OS X 10.6.7. Here is the code:
#define BUFSIZE (524288) // 512 KB
#define BLOCKBYTES (32) // 32 B
__kernel void test(__global unsigned char *in,
__global unsigned char *out,
unsigned int srcOffset,
unsigned int dstOffset) {
int grId = get_group_id(0);
unsigned char msg[BUFSIZE];
srcOffset = grId * BUFSIZE;
dstOffset = grId * BLOCKBYTES;
// Copy from global to private memory
size_t i;
for (i = 0; i < BUFSIZE; i++)
msg[i] = in[ srcOffset + i ];
// Make some computation here, not complicated logic
// Copy from private to global memory
for (i = 0; i < BLOCKBYTES; i++)
out[ dstOffset + i ] = msg[i];
}
The code gave me an runtime error "Bus error". When I makes help printf in cycle which copy from global to private memory then see there the error occurs, every time in different iteration of i. When I reduce size of BUFSIZE to 262144 (256 KB) then the code runs fine. I tried to have only one work-item on one work-group. The *in points to memory area which have thousands KB of data. I suspect to limit of private memory, but then threw an error in the allocation of memory, not when copy.
Here is my OpenCL device query:
-
--------------------------------
Device Intel(R) Core(TM)2 Duo CPU P7550 # 2.26GHz
---------------------------------
CL_DEVICE_NAME: Intel(R) Core(TM)2 Duo CPU P7550 # 2.26GHz
CL_DEVICE_VENDOR: Intel
CL_DRIVER_VERSION: 1.0
CL_DEVICE_VERSION: OpenCL 1.0
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1 / 1 / 1
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1
CL_DEVICE_MAX_CLOCK_FREQUENCY: 2260 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1024 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1535 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: global
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest
CL_DEVICE_IMAGE <dim> 2D_MAX_WIDTH 8192
2D_MAX_HEIGHT 8192
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048
CL_DEVICE_EXTENSIONS: cl_khr_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_byte_addressable_store
cl_APPLE_gl_sharing
cl_APPLE_SetMemObjectDestructor
cl_APPLE_ContextLoggingFunctions
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
You use a variable msg with a size of 512kB. This variable should be in private memory. The private memory is not that big. This shouldn't work, as far as I know.
Why do you have the parameters srcOffsetand dstOffset? You do not use them.
I do not see more issues. Try to allocate local memory. Do you have a version of you code without optimization running? A version which just calculates in global memory?
Related
I'm trying to stream audio from a QAudioInput to a QAudioOutput (Qt 5.15) but with a few seconds of artificial delay. In an effort to keep the code simple I implemented a delay filter based on a QIODevice, which sits between the input and output. The audio i/o is initialized like so:
QAudioFormat audioformat = ...;
QAudioInput *audioin = new QAudioInput(..., audioformat, this);
QAudioOutput *audioout = new QAudioOutput(..., audioformat, this);
DelayFilter *delay = new DelayFilter(this);
delay->setDelay(3.0, audioformat);
audioout->start(delay); // output reads from 'delay', writes to speakers
audioin->start(delay); // input reads from line input, writes to 'delay'
The delay filter is:
class DelayFilter : public QIODevice {
Q_OBJECT
public:
explicit DelayFilter (QObject *parent = nullptr);
void setDelay (int bytes);
void setDelay (double seconds, const QAudioFormat &format);
int delay () const { return delay_; }
protected:
qint64 readData (char *data, qint64 maxlen) override;
qint64 writeData (const char *data, qint64 len) override;
private:
int delay_; // delay length in bytes
QByteArray buffer_; // buffered data for delaying
int leadin_; // >0 = need to increase output delay, <0 = need to decrease
// debugging:
qint64 totalread_, totalwritten_;
};
And implemented like this:
DelayFilter::DelayFilter (QObject *parent)
: QIODevice(parent), delay_(0), leadin_(0), totalread_(0), totalwritten_(0)
{
open(QIODevice::ReadWrite);
}
void DelayFilter::setDelay (double seconds, const QAudioFormat &format) {
setDelay(format.bytesForFrames(qRound(seconds * format.sampleRate())));
}
void DelayFilter::setDelay (int bytes) {
bytes = std::max(0, bytes);
leadin_ += (bytes - delay_);
delay_ = bytes;
}
qint64 DelayFilter::writeData (const char *data, qint64 len) {
qint64 written = -1;
if (len >= 0) {
try {
buffer_.append(data, len);
written = len;
} catch (std::bad_alloc) {
}
}
if (written > 0) totalwritten_ += written;
//qDebug() << "wrote " << written << leadin_ << buffer_.size();
return written;
}
qint64 DelayFilter::readData (char *dest, qint64 maxlen) {
//qDebug() << "reading" << maxlen << leadin_ << buffer_.size();
qint64 w, bufpos;
for (w = 0; leadin_ > 0 && w < maxlen; --leadin_, ++w)
dest[w] = 0;
for (bufpos = 0; bufpos < buffer_.size() && leadin_ < 0; ++bufpos, ++leadin_)
;
// todo if needed: if upper limit is ok on buffered data, use a fixed size ring instead
if (leadin_ == 0) {
const char *bufdata = buffer_.constData();
for ( ; bufpos < buffer_.size() && w < maxlen; ++bufpos, ++w)
dest[w] = bufdata[bufpos];
buffer_ = buffer_.mid(bufpos);
}
totalread_ += w;
qDebug() << "read " << w << leadin_ << buffer_.size()
<< bufpos << totalwritten_ << totalread_ << (totalread_ - totalwritten_);
return w;
}
Where the fundamental idea is:
If I want to delay 3 seconds, I write out 3 seconds of silence and then start piping the data as usual.
And delay changes are handled like this (for completeness, although it's not relevant to this question because I'm seeing issues without changing delays):
If I decrease that 3 second delay to 2 seconds then I have to skip 1 second worth of input to catch up.
If I increase that 3 second delay to 4 seconds then I have to write 1 second of silence to fall behind.
That is all implemented via the leadin_ counter, which contains the number of frames I have to delay (> 0) or skip (< 0) to get to the desired delay. In my example case, it's set > 0 when the 3 second delay is configured, and that will provide 3 seconds of silence to the QAudioOutput before it starts passing along the buffered data.
The problem is that the delay is there when the app starts, but over the course of a few seconds, the delay completely disappears and there is no delay any more. I can hear that it's skipping samples here and there to catch up, with an occasional light click or pop in the audio output.
The debug printouts show some things are working:
Seemingly matched read / write timings
Smooth decrease in leadin_ to 0 over first 3 seconds
Smooth increase in total bytes read and written
Constant bytes_read - bytes_written value equal to the delay time after initial ramp up
But they also show some stuff isn't:
The buffer size is filled up by the QAudioInput and initially increases, but then begins decreasing (once leadin_ is exhausted) and stays stable at a low value. But I expect the buffer size to grow and then stay constant, equal to the delay time. The decrease means reads are happening faster than writes.
I can't make any sense of it. I added some debugging code to watch for state changes in the input / output to see if they were popping into Idle state (the output will do this to avoid buffer underruns) but they're not, they're just happily handling data with no apparent hiccups.
I expected this to work because both the input and output are using the same sample rate, and so once I initially get 3 seconds behind (or whatever delay time) I expected it to stay that way forever. I can't understand why, given that the input and output are configured at the same sample rate, the output is skipping samples and eating up the delay, and then playing smoothly again.
Am I missing some important override in my QIODevice implementation? Or is there some weird thing that Qt Multimedia does with audio buffering that is breaking this? Or am I just doing something fundamentally wrong here? Since this QIODevice-based delay is all very passive, I don't think I'm doing anything to drive the timing forward faster than it should be going, am I?
I hope this is clear; my definition of "read" and "write" kind of flip/flops above depending on context but I did my best.
Initial debug output (2nd number is leadin_, 3rd number is amount of buffered data, last number is read-written):
read 19200 268800 0 0 0 19200 19200
read 16384 252416 11520 0 11520 35584 24064
read 16384 236032 19200 0 19200 51968 32768
read 16384 219648 26880 0 26880 68352 41472
read 16384 203264 34560 0 34560 84736 50176
read 16384 186880 46080 0 46080 101120 55040
read 16384 170496 53760 0 53760 117504 63744
read 16384 154112 61440 0 61440 133888 72448
read 16384 137728 69120 0 69120 150272 81152
read 16384 121344 80640 0 80640 166656 86016
read 16384 104960 88320 0 88320 183040 94720
read 16384 88576 96000 0 96000 199424 103424
read 16384 72192 103680 0 103680 215808 112128
read 16384 55808 115200 0 115200 232192 116992
read 16384 39424 122880 0 122880 248576 125696
read 16384 23040 130560 0 130560 264960 134400
read 16384 6656 138240 0 138240 281344 143104
read 16384 0 140032 9728 149760 297728 147968
read 16384 0 131328 16384 157440 314112 156672
read 16384 0 122624 16384 165120 330496 165376
read 16384 0 113920 16384 172800 346880 174080
read 16384 0 109056 16384 184320 363264 178944
read 16384 0 100352 16384 192000 379648 187648
read 16384 0 91648 16384 199680 396032 196352
read 16384 0 82944 16384 207360 412416 205056
read 16384 0 78080 16384 218880 428800 209920
read 16384 0 69376 16384 226560 445184 218624
read 16384 0 60672 16384 234240 461568 227328
read 16384 0 51968 16384 241920 477952 236032
read 16384 0 47104 16384 253440 494336 240896
read 16384 0 38400 16384 261120 510720 249600
read 16384 0 29696 16384 268800 527104 258304
read 16384 0 20992 16384 276480 543488 267008
read 16384 0 16128 16384 288000 559872 271872
read 16384 0 7424 16384 295680 576256 280576
read 15104 0 0 15104 303360 591360 288000
read 7680 0 0 7680 311040 599040 288000
read 3840 0 0 3840 314880 602880 288000
read 3840 0 0 3840 318720 606720 288000
read 3840 0 0 3840 322560 610560 288000
read 3840 0 0 3840 326400 614400 288000
read 3840 0 0 3840 330240 618240 288000
read 3840 0 0 3840 334080 622080 288000
read 3840 0 0 3840 337920 625920 288000
I use OpenCL (under Ubuntu) to query the available platforms, which yields one platform, with
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VERSION: OpenCL 2.1 AMD-APP (3143.9)
CL_PLATFORM_NAME: AMD Accelerated Parallel Processing
CL_PLATFORM_VENDOR: Advanced Micro Devices, Inc.
Which offers one device, which I query with:
cl_device_id device = devices[ j ];
cl_uint units = -1;
cl_device_type type;
size_t lmem = -1;
cl_uint dims = -1;
size_t wisz[ 3 ];
size_t wgsz = -1;
size_t gmsz = -1;
err = clGetDeviceInfo( device, CL_DEVICE_NAME, sizeof(name), name, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_NAME, sizeof(vend), vend, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(units), &units, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_TYPE, sizeof(type), &type, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(lmem), &lmem, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(dims), &dims, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(wisz), &wisz, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(wgsz), &wgsz, 0 );
CHECK_CL
err = clGetDeviceInfo( device, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(gmsz), &gmsz, 0 );
CHECK_CL
if ( type == CL_DEVICE_TYPE_GPU )
device_id = device;
printf( " %s %s with [%d units] localmem=%zu globalmem=%zu dims=%d(%zux%zux%zu) max workgrp sz %zu", name, vend, units, lmem, gmsz, dims, wisz[0], wisz[1], wisz[2], wgsz );
Which gives me:
gfx1012 gfx1012 with [11 units] localmem=65536 globalmem=8573157376 dims=3(1024x1024x1024) max workgrp sz 256
The CL_DEVICE_MAX_COMPUTE_UNITS value of 11 worries me.
My system is equipped with the Radeon RX 5500 XT, which according to both AMDs website and Wikipedia is supposed to have 22 Compute Units.
Why does OpenCL report half the expected number, 11 Compute Units, instead of 22?
lspci reports:
19:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (rev c5) (prog-if 00 [VGA controller])
Subsystem: XFX Pine Group Inc. Navi 14 [Radeon RX 5500/5500M / Pro 5500M]
Flags: bus master, fast devsel, latency 0, IRQ 83, NUMA node 0
Memory at b0000000 (64-bit, prefetchable) [size=256M]
Memory at c0000000 (64-bit, prefetchable) [size=2M]
I/O ports at 7000 [size=256]
Memory at c5d00000 (32-bit, non-prefetchable) [size=512K]
Expansion ROM at c5d80000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
And the AMD GPU PRO driver was installed.
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: Radeon RX 5500 XT
OpenGL core profile version string: 4.6.14752 Core Profile Context 20.30
OpenGL core profile shading language version string: 4.60
For AMD RDNA GPUs, OpenCL with CL_DEVICE_MAX_COMPUTE_UNITS reports the number of dual compute units (see the RDNA whitepaper, pages 4-9). Each dual compute unit contains 2 compute units, as the name suggests. So your hardware and driver installation is fine.
I am experimenting with Vivante GPU GC2000 Series, where the clinfo produced the below result.
CL_DEVICE_GLOBAL_MEM_SIZE: 64 MByte
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 32 MByte
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE: Read/Write
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 4096
CL_DEVICE_LOCAL_MEM_SIZE: 1 KByte
CL_DEVICE_LOCAL_MEM_TYPE: Global
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 4 KByte
CL_DEVICE_MAX_CONSTANT_ARGS: 9
From above output, it is clear that 64MByte is the limit for Global Memory allocation.
Now, when I tried allocating 900Mbytes global size, i have not received any error and it is successful.
int noOfBytes = (900 * 1024 * 1024);
memPtr = clCreateBuffer(context, CL_MEM_READ_WRITE, noOfBytes, NULL, &err);
if ( err != CL_SUCESS) {
printf ("Ooops.. Failed");
}
Sounds this experiment is show proving what the clinfo claims. Am i missing any theory or something else ?
Because buffers and images are allocated on an OpenCL context (not an OpenCL device) the actual device allocation is often deferred until the buffer is used on a specific device. So while this allocation seemed to work, if you try to actually use that buffer on your device, you'll get an error.
I have an Nvidia Tesla K80 sitting in a LINUX box. I know that internally a Tesla K80 has two GPUs. When I ran a OpenCL program on that machine, looping over all the devices, I get to see four devices (4 Tesla K80s). Would you know why this could be happening?
Here is the host code:
ret = clGetPlatformIDs(0, NULL, &platformCount); openclCheck(ret);
platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * platformCount);
ret = clGetPlatformIDs(platformCount, platforms, NULL); openclCheck(ret);
printf("Detect %d platform available.\n",platformCount);
for (unsigned int i= 0; i < platformCount; i++) {
// get all devices
ret = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount); openclCheck(ret)
devices = (cl_device_id*) malloc(sizeof(cl_device_id) * deviceCount);
ret = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL); openclCheck(ret)
printf("Platform %d. %d device available.\n", i+1, deviceCount );
// for each device print critical attributes
for (unsigned int j = 0; j < deviceCount; j++) {
// print device name
ret = clGetDeviceInfo(devices[j], CL_DEVICE_NAME, 0, NULL, &valueSize); openclCheck(ret)
value = (char*) malloc(valueSize);
ret = clGetDeviceInfo(devices[j], CL_DEVICE_NAME, valueSize, value, NULL); openclCheck(ret)
printf("\t%d. Device: %s\n", j+1, value);
free(value);
//more code here to print device attributes
Here is the output:
Detect 1 platform available.
Platform 1. 4 device available.
1. Device: Tesla K80
1.1 Hardware version: OpenCL 1.2 CUDA
1.2 Software version: 352.79
1.3 OpenCL C version: OpenCL C 1.2
1.4 Parallel compute units: 13
2. Device: Tesla K80
2.1 Hardware version: OpenCL 1.2 CUDA
2.2 Software version: 352.79
2.3 OpenCL C version: OpenCL C 1.2
2.4 Parallel compute units: 13
3. Device: Tesla K80
3.1 Hardware version: OpenCL 1.2 CUDA
3.2 Software version: 352.79
3.3 OpenCL C version: OpenCL C 1.2
3.4 Parallel compute units: 13
4. Device: Tesla K80
4.1 Hardware version: OpenCL 1.2 CUDA
4.2 Software version: 352.79
4.3 OpenCL C version: OpenCL C 1.2
4.4 Parallel compute units: 13
Most probably 2 are 32-bit implementations, 2 are 64-bit implementations from multiple drivers. Maybe old drivers needs to be cleansed by some display driver uninstaller software. Check bitness of each device implementation please.
Or, there are virtual gpu(GRID?) services left active causing duplicated devices so maybe you can deactivate virtual gpu to solve this.
We created a small program to detect Xeon Phi, here is our code snippet
std::vector<cl::Platform> platformList(5);
std::vector<cl::Device> deviceList;
cl::Platform::get(&platformList);
if(platformList.size() == 0){
std::cout << "No Platforms found. Check OpenCL installation!" << std::endl;
exit(1);
}
for(i=0; i<platformList.size(); i++){
// for(i=0; i<1; i++){
std::cout << platformList[i].getInfo<CL_PLATFORM_NAME>()<< std::endl;
platformList[i].getDevices(CL_DEVICE_TYPE_ALL, &deviceList);
if(deviceList.size() == 0){
std::cout << "No Devices found. Check OpenCL installation!" << std::endl;
exit(1);
}
for(j=0; j<deviceList.size(); j++){
// dims = deviceList[j].getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();
// for(k=0; k<dims.size(); k++)
// std::cout << dims[k] << std::endl;
std::cout << deviceList[j].getInfo<CL_DEVICE_NAME>()<< std::endl;
}
}
cl::Device device = deviceList[j-1];
std::cout << "Using device: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;
but it does not detect the Phi, we get only this output;
Intel(R) OpenCL
Intel(R) Xeon(R) CPU E5-2609 v2 # 2.50GHz
Using device: Intel(R) Xeon(R) CPU E5-2609 v2 # 2.50GHz
Hello World
Do you know what are we doing wrong?
P.S. Below can you find micinfo output
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.
Created Thu Oct 2 15:04:08 2014
System Info
HOST OS : Linux
OS Version : 2.6.32-431.el6.x86_64
Driver Version : 3.2-1
MPSS Version : 3.2
Host Physical Memory : 16274 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.2
Device Serial Number : ADKC32800437
Board
Vendor ID : 0x8086
Device ID : 0x225d
Subsystem ID : 0x3608
Coprocessor Stepping ID : 2
PCIe Width : Insufficient Privileges
PCIe Speed : Insufficient Privileges
PCIe Max payload size : Insufficient Privileges
PCIe Max read req size : Insufficient Privileges
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-3120/3140 P/A
ECC Mode : Enabled
SMC HW Revision : Product 300W Active CS
Cores
Total No of Active Cores : 57
Voltage : 0 uV
Frequency : 1100000 kHz
Thermal
Fan Speed Control : On
Fan RPM : 1200
Fan PWM : 20
Die Temp : 45 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 5952 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
You might want to look at https://software.intel.com/en-us/articles/opencl-runtime-release-notes. It is more recent than the page Cicada pointed you to and provides a link to Intel® OpenCL™ Runtime 14.2.
The libmic_device.so is included with the OpenCL runtime and is, by default, in /opt/intel/opencl{version_number}/lib64. You will want to make sure that path is in your LD_LIBRARY_PATH environment variable. You will also want to make sure that /opt/intel/opencl{version_number}/mic is in your MIC_LD_LIBRARY_PATH environment variable.
You already have the Intel MPSS installed; otherwise micinfo would not work. The libcoi_host.so is included in the MPSS and installs in /usr/lib64, which is already in your library search path.
The version of the MPSS that you are running is 3.2-1. The "What's new" notes for the OpenCL runtime 14.1 on the release notes web page says that version 14.1 is unstable under MPSS 3.2-1. I am trying to find out if there is a different version of the runtime you can use with MPSS 3.2-1 that is more stable or if the only recommendation is to install a newer MPSS. You can find the latest MPSS releases at https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss.